prometheus: expand the metrics and storage section

I was trying to show how much disk space Prometheus was using and couldn't find good answers.

prometheus: expand the metrics and storage section
7092d145 · anarcat · bc7996b4 · 7092d145
Verified Commit 7092d145 authored 2 months ago by anarcat
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -46,6 +46,7 @@ follow the [training course](#training-course-plan) or see the [web dashboards s
  - Ensuring the tags required for routing are there
  - Link to prom graphs from prom's alert page

+ [TPA-RFC-33]: policy/tpa-rfc-33-monitoring
 [Alert debugging]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging
 [Queries cheat sheet]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet
 [Adding alerts]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert
@@ -2176,7 +2177,7 @@ Much of the initial Prometheus configuration was also documented in
 storage requirements and possible alternatives for data retention
 policies.

-[ticket 29388]: https://bugs.torproject.org/29388
+[ticket 29388]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29388
 [ticket 29681]: https://bugs.torproject.org/29681
 [use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303
 [allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304
@@ -2660,7 +2661,46 @@ There's also a [list of third-party exporters][] in the Prometheus documentation

 ## Storage

-<!-- TODO databases? plain text file? the frigging blockchain? memory? -->
+<a name="long-term-metrics-storage" />
+
+Prometheus stores data in its own custom "time-series database"
+(TSDB).
+
+Metrics are held for about a year or less, depending on the
+server. Look at [this dashboard for current disk usage of the
+Prometheus servers](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&timezone=utc&var-class=$__all&var-instance=hetzner-nbg1-01.torproject.org&var-instance=hetzner-nbg1-02.torproject.org&var-Filters&refresh=auto).
+
+The actual disk usage depends on:
+
+- `N`: the number of exporters
+- `X`: the number of metrics they expose
+- 1.3 bytes: the size of a sample
+- `P`: the retention period (currently 1 year)
+- `I`: scrape interval (currently one minute)
+
+The formula to compute disk usage is this:
+
+```
+N x X x 1.3 bytes x P / I
+```
+
+For example, in [ticket 29388][], we compute that a simple node
+exporter setup with 2500 metrics, with 80 nodes, will end up with
+137GiB of disk usage:
+
+```
+> 1.3byte/minute * year * 2500 * 80 to Gibyte
+
+  (1,3 * (byte / minute)) * year * 2500 * 80 = approx. 127,35799 gibibytes
+```
+
+Back then, we configured Prometheus to keep only 30 days of samples,
+but that proved to be insufficient for many cases, so it was raised to
+one year in 2020, in [issue 31244](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244).
+
+In the [retention section of TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#retention), there is a detailed
+discussion on retention periods. We're considering [multi-year
+retention periods](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330) for the future.

 ## Queues

@@ -2851,20 +2891,17 @@ details.

 ## Monitoring and metrics

+Prometheus is, of course, all about monitoring and metrics. It
+is the thing that monitors everything and keeps metrics over the long
+term.
+
 The server monitors itself for system-level metrics but also
 application-specific metrics. There's a long-term plan for
 high-availability in [TPA-RFC-33-C][].

 [TPA-RFC-33-C]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15

-Metrics are held for about a year or less, depending on the server,
-see [ticket 29388][] for storage requirements and possible
-alternatives for data retention policies.
-
-Note that [TPA-RFC-33][] also discusses alternative metrics retention
-policies.
-
-[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
+See also [storage](#storage) for retention policies.

 ## Tests