... | ... | @@ -823,6 +823,61 @@ would still be able to deduce some activity patterns from the metrics |
|
|
generated by Prometheus, and use it to leverage side-channel attacks,
|
|
|
which is why the external Prometheus server access is restricted.
|
|
|
|
|
|
### Long term metrics storage
|
|
|
|
|
|
Metrics are held for about a year or less, depending on the server,
|
|
|
see [ticket 29388][] for storage requirements and possible
|
|
|
alternatives for data retention policies.
|
|
|
|
|
|
Note that extra long-term data retention might be possible [using
|
|
|
the remote read functionality](https://www.robustperception.io/looking-beyond-retention), which enables the primary server to
|
|
|
read metrics from a secondary, longer-term server transparently,
|
|
|
keeping graphs working without having to change data source, for
|
|
|
example.
|
|
|
|
|
|
That way you could have a short-term server which keeps lots of
|
|
|
metrics and polls every minute or even 15 seconds, but keeps (say)
|
|
|
only 30 days of data and a long-term server which would poll the
|
|
|
short-term server every (say) 5 minutes) but keep (say) 5 years of
|
|
|
metrics. But how much data would that be?
|
|
|
|
|
|
The [last time we made an estimate, in May 2020](https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965), we had the
|
|
|
following calculation for 1 minute polling interval over a year:
|
|
|
|
|
|
```
|
|
|
> 365d×1.3byte/(1min)×2000×78 to Gibyte
|
|
|
99,271238 gibibytes
|
|
|
```
|
|
|
|
|
|
At the time of writing (August 2021), that is still the configured
|
|
|
interval, and the disk usage roughly matches that (98GB used). This
|
|
|
implies that we could store about 5 years of metrics with a 5 minute
|
|
|
polling interval, using the same disk usage, obviously:
|
|
|
|
|
|
```
|
|
|
> 5*365d×1.3byte/(5min)×2000×78 to Gibyte
|
|
|
99,271238 gibibytes
|
|
|
```
|
|
|
|
|
|
... or 15 years with 15 minutes, etc... As a rule of thumb, as long as
|
|
|
we multiple the scrape interval, we can multiply the retention period
|
|
|
as well.
|
|
|
|
|
|
On the other side, we might be able to increase granularity quite a
|
|
|
bit by lowering the retention to (say) 30 days and 5 seconds polling
|
|
|
interval, which would give us:
|
|
|
|
|
|
```
|
|
|
> 30d*1.3byte/(5 second)*2000*78 to Gibyte
|
|
|
97,911358 gibibytes
|
|
|
```
|
|
|
|
|
|
That might be a bit aggressive though: the default Prometheus
|
|
|
`scrape_interval` is 15 seconds, not 5 seconds... With the defaults
|
|
|
(15 seconds scrape interval, 30 days retention), we'd be at about
|
|
|
30GiB disk usage, which makes for a quite reasonable and easy to
|
|
|
replicate primary server.
|
|
|
|
|
|
## Backups
|
|
|
|
|
|
Prometheus servers should be fully configured through Puppet and
|
... | ... | |