in #40950, i have tried and failed to deploy barman as a replacement for our legacy database backup system, in the hope of reusing its exporter and not having to port our stuff to it.
now we need to figure out another way, in the short term.
we need to somehow replace or port dsa-check-backuppg to prometheus.
keep in mind that thing as side effects! from what i remember, it seems it rotates files as well, so maybe we want to split it up into something that rotates file and another that exports metrics, to keep our sanity.
so i have started this. first off, i made sure we monitor the health of the postgres exporter everywhere. it wasn't healthy everywhere!
then i added at least one critical alert that will warn if the archiver is laggy or doing errors (based on those metrics, in prometheus-alerts@856893bc. turns out this actually flags metricsdb-01 as being without backups, nice! except we decided to do that in #41626, so i added an exclusion on this in prometheus-alerts@0603159e
this doesn't cover the server-side of things (nor does it cover rotations), but it's a good start. i don't think there's a way to cover base backups strictly by monitoring the postgresql server itself, as it considers backups to be external. there is information about in progress backups but that is volatile: no history of past backups is kept there.
next step is probably to make a small script that will reproduce parts of the nagios check, but as prometheus metrics.
we will likely need to keep tor-nagios-checks for this, if just for rotation, however. i've added it to the checklist in #41671.
so it turns out this was relatively simple. i threw a check in the dsa-check-backuppg script directly, which outputs this:
# HELP tpa_backuppg_last_check_timestamp_seconds last time backups were checked# TYPE tpa_backuppg_last_check_timestamp_seconds gaugetpa_backuppg_last_check_timestamp_seconds 1727211781.433234# HELP tpa_backuppg_error_count number of errors found in last check, 0 on success# TYPE tpa_backuppg_error_count gaugetpa_backuppg_error_count 0
the error count is the most important bit, and is what the script relies on internally to declare this success or failure in nagios. we alert on > 0. we also check the timestamp for freshness.
so this is actually done, amazingly. we need to make sure we keep the cron job or port to another shit, but that's tracked in #41671. i've also added a checklist item in #40695 (closed) to make sure we keep those cron jobs.
this was still broken because the textfile collector directory was not writable by the tordonate user. this is now fixed, it's 1775 (sticky, group-writable) and owned by a new prometheus-textfile group that i added tordonate to.
as a fun fact, both nagios and prometheus noticed that the checks weren't working anymore.
nagios was doing the eerie:
17:01:01 <nsa> tor-nagios: [bungei] postgresql backups is WARNING: OK: no problems detected
the WARNING: OK is because the OK: no problems detected comes from the host's NRPE check, which claims things are okay, but then fails to write the metrics and crashes with an exit 1 code.
and prometheus noticed, much later:
20:03:42 -ALERTOR1:#tor-alerts- PgLegacyBackupsStale[node/warning] alert is firing, 1 alerts on bungei.torproject.org
that's because it relies on staleness and doesn't notice the job failing, which is interesting: one would have thought cron would yell... but this is running from NRPE, so it's nagios yelling. presumably, when the daily expiry job would have ran, it would have yelled...
resolution was noted by both properly:
12:08:42 -ALERTOR1:#tor-alerts- PgLegacyBackupsStale[node/warning] alert is resolved, 1 alerts on -bungei.torproject.org12:15:50 <nsa> tor-nagios: [bungei] postgresql backups is OK: OK: no problems detected