From 7092d145340ca40db455fb5d65292468d09db146 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Fri, 17 Jan 2025 16:05:06 -0500 Subject: [PATCH] prometheus: expand the metrics and storage section I was trying to show how much disk space Prometheus was using and couldn't find good answers. --- service/prometheus.md | 57 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 47 insertions(+), 10 deletions(-) diff --git a/service/prometheus.md b/service/prometheus.md index dc1855a3..858a9b9a 100644 --- a/service/prometheus.md +++ b/service/prometheus.md @@ -46,6 +46,7 @@ follow the [training course](#training-course-plan) or see the [web dashboards s - Ensuring the tags required for routing are there - Link to prom graphs from prom's alert page + [TPA-RFC-33]: policy/tpa-rfc-33-monitoring [Alert debugging]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#alert-debugging [Queries cheat sheet]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#queries-cheat-sheet [Adding alerts]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert @@ -2176,7 +2177,7 @@ Much of the initial Prometheus configuration was also documented in storage requirements and possible alternatives for data retention policies. -[ticket 29388]: https://bugs.torproject.org/29388 +[ticket 29388]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29388 [ticket 29681]: https://bugs.torproject.org/29681 [use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303 [allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304 @@ -2660,7 +2661,46 @@ There's also a [list of third-party exporters][] in the Prometheus documentation ## Storage -<!-- TODO databases? plain text file? the frigging blockchain? memory? --> +<a name="long-term-metrics-storage" /> + +Prometheus stores data in its own custom "time-series database" +(TSDB). + +Metrics are held for about a year or less, depending on the +server. Look at [this dashboard for current disk usage of the +Prometheus servers](https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&timezone=utc&var-class=$__all&var-instance=hetzner-nbg1-01.torproject.org&var-instance=hetzner-nbg1-02.torproject.org&var-Filters&refresh=auto). + +The actual disk usage depends on: + +- `N`: the number of exporters +- `X`: the number of metrics they expose +- 1.3 bytes: the size of a sample +- `P`: the retention period (currently 1 year) +- `I`: scrape interval (currently one minute) + +The formula to compute disk usage is this: + +``` +N x X x 1.3 bytes x P / I +``` + +For example, in [ticket 29388][], we compute that a simple node +exporter setup with 2500 metrics, with 80 nodes, will end up with +137GiB of disk usage: + +``` +> 1.3byte/minute * year * 2500 * 80 to Gibyte + + (1,3 * (byte / minute)) * year * 2500 * 80 = approx. 127,35799 gibibytes +``` + +Back then, we configured Prometheus to keep only 30 days of samples, +but that proved to be insufficient for many cases, so it was raised to +one year in 2020, in [issue 31244](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244). + +In the [retention section of TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#retention), there is a detailed +discussion on retention periods. We're considering [multi-year +retention periods](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330) for the future. ## Queues @@ -2851,20 +2891,17 @@ details. ## Monitoring and metrics +Prometheus is, of course, all about monitoring and metrics. It +is the thing that monitors everything and keeps metrics over the long +term. + The server monitors itself for system-level metrics but also application-specific metrics. There's a long-term plan for high-availability in [TPA-RFC-33-C][]. [TPA-RFC-33-C]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15 -Metrics are held for about a year or less, depending on the server, -see [ticket 29388][] for storage requirements and possible -alternatives for data retention policies. - -Note that [TPA-RFC-33][] also discusses alternative metrics retention -policies. - -[TPA-RFC-33]: policy/tpa-rfc-33-monitoring +See also [storage](#storage) for retention policies. ## Tests -- GitLab