From ebb4fc90ed72a75426a784b4ee4f937d15ab6d38 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>
Date: Mon, 23 Aug 2021 17:03:33 -0400
Subject: [PATCH] long term prometheus metrics storage

---
 howto/prometheus.md | 55 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/howto/prometheus.md b/howto/prometheus.md
index b7dd01b9..4098c33a 100644
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
@@ -823,6 +823,61 @@ would still be able to deduce some activity patterns from the metrics
 generated by Prometheus, and use it to leverage side-channel attacks,
 which is why the external Prometheus server access is restricted.
 
+### Long term metrics storage
+
+Metrics are held for about a year or less, depending on the server,
+see [ticket 29388][] for storage requirements and possible
+alternatives for data retention policies.
+
+Note that extra long-term data retention might be possible [using
+the remote read functionality](https://www.robustperception.io/looking-beyond-retention), which enables the primary server to
+read metrics from a secondary, longer-term server transparently,
+keeping graphs working without having to change data source, for
+example.
+
+That way you could have a short-term server which keeps lots of
+metrics and polls every minute or even 15 seconds, but keeps (say)
+only 30 days of data and a long-term server which would poll the
+short-term server every (say) 5 minutes) but keep (say) 5 years of
+metrics. But how much data would that be?
+
+The [last time we made an estimate, in May 2020](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965), we had the
+following calculation for 1 minute polling interval over a year:
+
+```
+> 365dÃ—1.3byte/(1min)Ã—2000Ã—78 to Gibyte
+99,271238 gibibytes
+```
+
+At the time of writing (August 2021), that is still the configured
+interval, and the disk usage roughly matches that (98GB used). This
+implies that we could store about 5 years of metrics with a 5 minute
+polling interval, using the same disk usage, obviously:
+
+```
+> 5*365dÃ—1.3byte/(5min)Ã—2000Ã—78 to Gibyte
+99,271238 gibibytes
+```
+
+... or 15 years with 15 minutes, etc... As a rule of thumb, as long as
+we multiple the scrape interval, we can multiply the retention period
+as well.
+
+On the other side, we might be able to increase granularity quite a
+bit by lowering the retention to (say) 30 days and 5 seconds polling
+interval, which would give us:
+
+```
+> 30d*1.3byte/(5 second)*2000*78 to Gibyte
+97,911358 gibibytes
+```
+
+That might be a bit aggressive though: the default Prometheus
+`scrape_interval` is 15 seconds, not 5 seconds... With the defaults
+(15 seconds scrape interval, 30 days retention), we'd be at about
+30GiB disk usage, which makes for a quite reasonable and easy to
+replicate primary server.
+
 ## Backups
 
 Prometheus servers should be fully configured through Puppet and
-- 
GitLab