Verified Commit 0c370314 authored by anarcat's avatar anarcat
Browse files

add missing backups stale pager playbook

parent 04f6d2ea
Loading
Loading
Loading
Loading
+44 −0
Original line number Diff line number Diff line
@@ -1550,6 +1550,50 @@ A few ideas:
 - consider bumping the `max_connections` setting (in
   `postgresql.conf`) if this is a long term trend

### Backups stale

The `PgLegacyBackupsStale` alert looks like this:

    PostgreSQL backup checks are stale on test.torproject.org

This implies the backup checks have not ran recently enough. This
implies that we might not have accurate information on our backups and
should be fixed.

Those metrics are exported by
`/usr/lib/nagios/plugins/dsa-check-backuppg` script which is called
from the `/etc/cron.d/tor-backup-postgres` cron job, deployed by
Puppet. It writes to
`/var/lib/prometheus/node-exporter/tpa_backuppg.prom` which gets
collected by the node exporter "text file collector".

Check that `/var/lib/prometheus/node-exporter/tpa_backuppg.prom` has
the right timestamp, for example compare this:

    tpa_backuppg_last_check_timestamp_seconds 1728495929.425171

With the current timestamp:

```
root@bungei:~# date +%s
1728511576
```

In this case, we have a 15647 lag, which is more than 4 hours.

You should be able to run the check by hand with the command:

```
sudo -u torbackup /usr/lib/nagios/plugins/dsa-check-backuppg  -e
```

It shouldn't return any output, but it *should* update the above
timestamp.

In this case, the schedule in the cron job was incorrect. It was not
found because Nagios was running the check through NRPE and hiding the
problem, the schedule was changed to run every 15 minutes instead.

## Disaster recovery

If a PostgreSQL server is destroyed completely or in part, we need to