anarcat · a88c640f
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
@@ -536,6 +536,43 @@ workaround is simply to restart the Pushgateway (and clear the
 storage, if persistence is enabled, see the `--persistence.file`
 flag).

+### Running out of disk space
+
+In [tpo/tpa/team#41070](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070), we encountered a situation where disk
+usage on the main Prometheus server was growing linearly even if the
+number of targets didn't change. This is a typical problem in time
+series like this where the "cardinality" of metrics grows without
+bound, consuming more and more disk space as time goes by.
+
+The first step is to confirm the diagnosis by looking at the [Grafana
+graph showing Prometheus disk usage](https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now) over time. This should show a
+"sawtooth" pattern where compactions happen regularly (about once
+every three weeks), but without growing much over longer periods of
+time. In the above ticket, the usage was growing despite
+compactions. There are also shorter-term (~4h) and smaller compactions
+happening. This information is also available in the normal [disk
+usage graphic](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2).
+
+We then headed for the self-diagnostics Prometheus provides at:
+
+<https://prometheus.torproject.org/classic/status>
+
+The "Most Common Label Pairs" section will show us which `job` is
+responsible for the most number of metrics. It should be `job=node`,
+as that collects a lot of information for *all* the machines managed
+by TPA. About 100k pairs is expected there.
+
+It's also expected to see the "Highest Cardinality Labels" to be
+`__name__` at around 1600 entries.
+
+We haven't implemented it yet, but the [upstream Storage
+documentation](https://prometheus.io/docs/prometheus/1.8/storage/) has some interesting tips, including [advice on
+long-term storage](https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time) which suggests tweaking the
+`storage.local.series-file-shrink-ratio`.
+
+[This guide](https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/) also had some useful queries and tips we didn't fully
+investigate.
+
 ## Disaster recovery

 If a Prometheus/Grafana is destroyed, it should be compltely