long term prometheus metrics storage

ebb4fc90 · anarcat · 14d30eee · ebb4fc90
Verified Commit ebb4fc90 authored 3 years ago by anarcat
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
@@ -823,6 +823,61 @@ would still be able to deduce some activity patterns from the metrics
 generated by Prometheus, and use it to leverage side-channel attacks,
 which is why the external Prometheus server access is restricted.

+### Long term metrics storage
+
+Metrics are held for about a year or less, depending on the server,
+see [ticket 29388][] for storage requirements and possible
+alternatives for data retention policies.
+
+Note that extra long-term data retention might be possible [using
+the remote read functionality](https://www.robustperception.io/looking-beyond-retention), which enables the primary server to
+read metrics from a secondary, longer-term server transparently,
+keeping graphs working without having to change data source, for
+example.
+
+That way you could have a short-term server which keeps lots of
+metrics and polls every minute or even 15 seconds, but keeps (say)
+only 30 days of data and a long-term server which would poll the
+short-term server every (say) 5 minutes) but keep (say) 5 years of
+metrics. But how much data would that be?
+
+The [last time we made an estimate, in May 2020](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965), we had the
+following calculation for 1 minute polling interval over a year:
+
+```
+> 365d×1.3byte/(1min)×2000×78 to Gibyte
+99,271238 gibibytes
+```
+
+At the time of writing (August 2021), that is still the configured
+interval, and the disk usage roughly matches that (98GB used). This
+implies that we could store about 5 years of metrics with a 5 minute
+polling interval, using the same disk usage, obviously:
+
+```
+> 5*365d×1.3byte/(5min)×2000×78 to Gibyte
+99,271238 gibibytes
+```
+
+... or 15 years with 15 minutes, etc... As a rule of thumb, as long as
+we multiple the scrape interval, we can multiply the retention period
+as well.
+
+On the other side, we might be able to increase granularity quite a
+bit by lowering the retention to (say) 30 days and 5 seconds polling
+interval, which would give us:
+
+```
+> 30d*1.3byte/(5 second)*2000*78 to Gibyte
+97,911358 gibibytes
+```
+
+That might be a bit aggressive though: the default Prometheus
+`scrape_interval` is 15 seconds, not 5 seconds... With the defaults
+(15 seconds scrape interval, 30 days retention), we'd be at about
+30GiB disk usage, which makes for a quite reasonable and easy to
+replicate primary server.
+
 ## Backups

 Prometheus servers should be fully configured through Puppet and