prometheus: expand the metrics and storage section authored by anarcat's avatar anarcat
I was trying to show how much disk space Prometheus was using and
couldn't find good answers.
......@@ -46,6 +46,7 @@ follow the [training course](#training-course-plan) or see the [web dashboards s
- Ensuring the tags required for routing are there
- Link to prom graphs from prom's alert page
[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
[Alert debugging]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging
[Queries cheat sheet]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet
[Adding alerts]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert
......@@ -2176,7 +2177,7 @@ Much of the initial Prometheus configuration was also documented in
storage requirements and possible alternatives for data retention
policies.
[ticket 29388]: https://bugs.torproject.org/29388
[ticket 29388]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29388
[ticket 29681]: https://bugs.torproject.org/29681
[use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303
[allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304
......@@ -2660,7 +2661,46 @@ There's also a [list of third-party exporters][] in the Prometheus documentation
## Storage
<!-- TODO databases? plain text file? the frigging blockchain? memory? -->
<a name="long-term-metrics-storage" />
Prometheus stores data in its own custom "time-series database"
(TSDB).
Metrics are held for about a year or less, depending on the
server. Look at [this dashboard for current disk usage of the
Prometheus servers](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&timezone=utc&var-class=$__all&var-instance=hetzner-nbg1-01.torproject.org&var-instance=hetzner-nbg1-02.torproject.org&var-Filters&refresh=auto).
The actual disk usage depends on:
- `N`: the number of exporters
- `X`: the number of metrics they expose
- 1.3 bytes: the size of a sample
- `P`: the retention period (currently 1 year)
- `I`: scrape interval (currently one minute)
The formula to compute disk usage is this:
```
N x X x 1.3 bytes x P / I
```
For example, in [ticket 29388][], we compute that a simple node
exporter setup with 2500 metrics, with 80 nodes, will end up with
137GiB of disk usage:
```
> 1.3byte/minute * year * 2500 * 80 to Gibyte
(1,3 * (byte / minute)) * year * 2500 * 80 = approx. 127,35799 gibibytes
```
Back then, we configured Prometheus to keep only 30 days of samples,
but that proved to be insufficient for many cases, so it was raised to
one year in 2020, in [issue 31244](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244).
In the [retention section of TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#retention), there is a detailed
discussion on retention periods. We're considering [multi-year
retention periods](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330) for the future.
## Queues
......@@ -2851,20 +2891,17 @@ details.
## Monitoring and metrics
Prometheus is, of course, all about monitoring and metrics. It
is the thing that monitors everything and keeps metrics over the long
term.
The server monitors itself for system-level metrics but also
application-specific metrics. There's a long-term plan for
high-availability in [TPA-RFC-33-C][].
[TPA-RFC-33-C]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15
Metrics are held for about a year or less, depending on the server,
see [ticket 29388][] for storage requirements and possible
alternatives for data retention policies.
Note that [TPA-RFC-33][] also discusses alternative metrics retention
policies.
[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
See also [storage](#storage) for retention policies.
## Tests
......
......