Skip to content
Snippets Groups Projects
Verified Commit 7092d145 authored by anarcat's avatar anarcat
Browse files

prometheus: expand the metrics and storage section

I was trying to show how much disk space Prometheus was using and
couldn't find good answers.
parent bc7996b4
No related branches found
No related tags found
No related merge requests found
......@@ -46,6 +46,7 @@ follow the [training course](#training-course-plan) or see the [web dashboards s
- Ensuring the tags required for routing are there
- Link to prom graphs from prom's alert page
[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
[Alert debugging]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging
[Queries cheat sheet]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet
[Adding alerts]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert
......@@ -2176,7 +2177,7 @@ Much of the initial Prometheus configuration was also documented in
storage requirements and possible alternatives for data retention
policies.
[ticket 29388]: https://bugs.torproject.org/29388
[ticket 29388]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29388
[ticket 29681]: https://bugs.torproject.org/29681
[use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303
[allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304
......@@ -2660,7 +2661,46 @@ There's also a [list of third-party exporters][] in the Prometheus documentation
## Storage
<!-- TODO databases? plain text file? the frigging blockchain? memory? -->
<a name="long-term-metrics-storage" />
Prometheus stores data in its own custom "time-series database"
(TSDB).
Metrics are held for about a year or less, depending on the
server. Look at [this dashboard for current disk usage of the
Prometheus servers](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&timezone=utc&var-class=$__all&var-instance=hetzner-nbg1-01.torproject.org&var-instance=hetzner-nbg1-02.torproject.org&var-Filters&refresh=auto).
The actual disk usage depends on:
- `N`: the number of exporters
- `X`: the number of metrics they expose
- 1.3 bytes: the size of a sample
- `P`: the retention period (currently 1 year)
- `I`: scrape interval (currently one minute)
The formula to compute disk usage is this:
```
N x X x 1.3 bytes x P / I
```
For example, in [ticket 29388][], we compute that a simple node
exporter setup with 2500 metrics, with 80 nodes, will end up with
137GiB of disk usage:
```
> 1.3byte/minute * year * 2500 * 80 to Gibyte
(1,3 * (byte / minute)) * year * 2500 * 80 = approx. 127,35799 gibibytes
```
Back then, we configured Prometheus to keep only 30 days of samples,
but that proved to be insufficient for many cases, so it was raised to
one year in 2020, in [issue 31244](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244).
In the [retention section of TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#retention), there is a detailed
discussion on retention periods. We're considering [multi-year
retention periods](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330) for the future.
## Queues
......@@ -2851,20 +2891,17 @@ details.
## Monitoring and metrics
Prometheus is, of course, all about monitoring and metrics. It
is the thing that monitors everything and keeps metrics over the long
term.
The server monitors itself for system-level metrics but also
application-specific metrics. There's a long-term plan for
high-availability in [TPA-RFC-33-C][].
[TPA-RFC-33-C]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15
Metrics are held for about a year or less, depending on the server,
see [ticket 29388][] for storage requirements and possible
alternatives for data retention policies.
Note that [TPA-RFC-33][] also discusses alternative metrics retention
policies.
[TPA-RFC-33]: policy/tpa-rfc-33-monitoring
See also [storage](#storage) for retention policies.
## Tests
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment