From ebb4fc90ed72a75426a784b4ee4f937d15ab6d38 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Mon, 23 Aug 2021 17:03:33 -0400 Subject: [PATCH] long term prometheus metrics storage --- howto/prometheus.md | 55 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/howto/prometheus.md b/howto/prometheus.md index b7dd01b9..4098c33a 100644 --- a/howto/prometheus.md +++ b/howto/prometheus.md @@ -823,6 +823,61 @@ would still be able to deduce some activity patterns from the metrics generated by Prometheus, and use it to leverage side-channel attacks, which is why the external Prometheus server access is restricted. +### Long term metrics storage + +Metrics are held for about a year or less, depending on the server, +see [ticket 29388][] for storage requirements and possible +alternatives for data retention policies. + +Note that extra long-term data retention might be possible [using +the remote read functionality](https://www.robustperception.io/looking-beyond-retention), which enables the primary server to +read metrics from a secondary, longer-term server transparently, +keeping graphs working without having to change data source, for +example. + +That way you could have a short-term server which keeps lots of +metrics and polls every minute or even 15 seconds, but keeps (say) +only 30 days of data and a long-term server which would poll the +short-term server every (say) 5 minutes) but keep (say) 5 years of +metrics. But how much data would that be? + +The [last time we made an estimate, in May 2020](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965), we had the +following calculation for 1 minute polling interval over a year: + +``` +> 365d×1.3byte/(1min)×2000×78 to Gibyte +99,271238 gibibytes +``` + +At the time of writing (August 2021), that is still the configured +interval, and the disk usage roughly matches that (98GB used). This +implies that we could store about 5 years of metrics with a 5 minute +polling interval, using the same disk usage, obviously: + +``` +> 5*365d×1.3byte/(5min)×2000×78 to Gibyte +99,271238 gibibytes +``` + +... or 15 years with 15 minutes, etc... As a rule of thumb, as long as +we multiple the scrape interval, we can multiply the retention period +as well. + +On the other side, we might be able to increase granularity quite a +bit by lowering the retention to (say) 30 days and 5 seconds polling +interval, which would give us: + +``` +> 30d*1.3byte/(5 second)*2000*78 to Gibyte +97,911358 gibibytes +``` + +That might be a bit aggressive though: the default Prometheus +`scrape_interval` is 15 seconds, not 5 seconds... With the defaults +(15 seconds scrape interval, 30 days retention), we'd be at about +30GiB disk usage, which makes for a quite reasonable and easy to +replicate primary server. + ## Backups Prometheus servers should be fully configured through Puppet and -- GitLab