Skip to content
Snippets Groups Projects
Verified Commit ebb4fc90 authored by anarcat's avatar anarcat
Browse files

long term prometheus metrics storage

parent 14d30eee
No related branches found
No related tags found
No related merge requests found
......@@ -823,6 +823,61 @@ would still be able to deduce some activity patterns from the metrics
generated by Prometheus, and use it to leverage side-channel attacks,
which is why the external Prometheus server access is restricted.
### Long term metrics storage
Metrics are held for about a year or less, depending on the server,
see [ticket 29388][] for storage requirements and possible
alternatives for data retention policies.
Note that extra long-term data retention might be possible [using
the remote read functionality](https://www.robustperception.io/looking-beyond-retention), which enables the primary server to
read metrics from a secondary, longer-term server transparently,
keeping graphs working without having to change data source, for
example.
That way you could have a short-term server which keeps lots of
metrics and polls every minute or even 15 seconds, but keeps (say)
only 30 days of data and a long-term server which would poll the
short-term server every (say) 5 minutes) but keep (say) 5 years of
metrics. But how much data would that be?
The [last time we made an estimate, in May 2020](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965), we had the
following calculation for 1 minute polling interval over a year:
```
> 365d×1.3byte/(1min)×2000×78 to Gibyte
99,271238 gibibytes
```
At the time of writing (August 2021), that is still the configured
interval, and the disk usage roughly matches that (98GB used). This
implies that we could store about 5 years of metrics with a 5 minute
polling interval, using the same disk usage, obviously:
```
> 5*365d×1.3byte/(5min)×2000×78 to Gibyte
99,271238 gibibytes
```
... or 15 years with 15 minutes, etc... As a rule of thumb, as long as
we multiple the scrape interval, we can multiply the retention period
as well.
On the other side, we might be able to increase granularity quite a
bit by lowering the retention to (say) 30 days and 5 seconds polling
interval, which would give us:
```
> 30d*1.3byte/(5 second)*2000*78 to Gibyte
97,911358 gibibytes
```
That might be a bit aggressive though: the default Prometheus
`scrape_interval` is 15 seconds, not 5 seconds... With the defaults
(15 seconds scrape interval, 30 days retention), we'd be at about
30GiB disk usage, which makes for a quite reasonable and easy to
replicate primary server.
## Backups
Prometheus servers should be fully configured through Puppet and
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment