... | ... | @@ -536,6 +536,43 @@ workaround is simply to restart the Pushgateway (and clear the |
|
|
storage, if persistence is enabled, see the `--persistence.file`
|
|
|
flag).
|
|
|
|
|
|
### Running out of disk space
|
|
|
|
|
|
In [tpo/tpa/team#41070](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070), we encountered a situation where disk
|
|
|
usage on the main Prometheus server was growing linearly even if the
|
|
|
number of targets didn't change. This is a typical problem in time
|
|
|
series like this where the "cardinality" of metrics grows without
|
|
|
bound, consuming more and more disk space as time goes by.
|
|
|
|
|
|
The first step is to confirm the diagnosis by looking at the [Grafana
|
|
|
graph showing Prometheus disk usage](https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now) over time. This should show a
|
|
|
"sawtooth" pattern where compactions happen regularly (about once
|
|
|
every three weeks), but without growing much over longer periods of
|
|
|
time. In the above ticket, the usage was growing despite
|
|
|
compactions. There are also shorter-term (~4h) and smaller compactions
|
|
|
happening. This information is also available in the normal [disk
|
|
|
usage graphic](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2).
|
|
|
|
|
|
We then headed for the self-diagnostics Prometheus provides at:
|
|
|
|
|
|
<https://prometheus.torproject.org/classic/status>
|
|
|
|
|
|
The "Most Common Label Pairs" section will show us which `job` is
|
|
|
responsible for the most number of metrics. It should be `job=node`,
|
|
|
as that collects a lot of information for *all* the machines managed
|
|
|
by TPA. About 100k pairs is expected there.
|
|
|
|
|
|
It's also expected to see the "Highest Cardinality Labels" to be
|
|
|
`__name__` at around 1600 entries.
|
|
|
|
|
|
We haven't implemented it yet, but the [upstream Storage
|
|
|
documentation](https://prometheus.io/docs/prometheus/1.8/storage/) has some interesting tips, including [advice on
|
|
|
long-term storage](https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time) which suggests tweaking the
|
|
|
`storage.local.series-file-shrink-ratio`.
|
|
|
|
|
|
[This guide](https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/) also had some useful queries and tips we didn't fully
|
|
|
investigate.
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
If a Prometheus/Grafana is destroyed, it should be compltely
|
... | ... | |