prometheus: estimate time to recovery authored by lelutin's avatar lelutin
+ be a bit more precise on information held by alertmanager.
...@@ -2404,12 +2404,18 @@ Puppet. ...@@ -2404,12 +2404,18 @@ Puppet.
Non-configuration data should be restored from backup, with Non-configuration data should be restored from backup, with
`/var/lib/prometheus/` being sufficient to reconstruct history. `/var/lib/prometheus/` being sufficient to reconstruct history.
The time to restore data depends on the data size and state of the network, but
for a rough indication on 2025-11-19, the dataset was 144Gb large and the
transfer took somewhere between 2.5 and 3h.
If even backups are destroyed, history will be lost, but the server should still If even backups are destroyed, history will be lost, but the server should still
recover and start tracking new metrics. recover and start tracking new metrics.
Note that neither Alertmanager nor Karma hold specific state data, so nothing Note that Alertmanager holds information about the current alert silences in
needs to be taken out of backups for those and as long as prometheus is tracking place. If those are lost, we can recreate silences on a need-to basis. Karma
metrics they should both be working as well. does not hold specific state data, so nothing needs to be taken out of backups
for it. Also, as long as prometheus is tracking metrics both services should
both be working as well.
# Reference # Reference
... ...
......