add missing backups stale pager playbook (0c370314) · Commits · The Tor Project / TPA / Wiki Replica

howto/postgresql.md

+44 −0

Original line number	Diff line number	Diff line
		@@ -1550,6 +1550,50 @@ A few ideas:
		- consider bumping the `max_connections` setting (in
		`postgresql.conf`) if this is a long term trend

		### Backups stale

		The `PgLegacyBackupsStale` alert looks like this:

		PostgreSQL backup checks are stale on test.torproject.org

		This implies the backup checks have not ran recently enough. This
		implies that we might not have accurate information on our backups and
		should be fixed.

		Those metrics are exported by
		`/usr/lib/nagios/plugins/dsa-check-backuppg` script which is called
		from the `/etc/cron.d/tor-backup-postgres` cron job, deployed by
		Puppet. It writes to
		`/var/lib/prometheus/node-exporter/tpa_backuppg.prom` which gets
		collected by the node exporter "text file collector".

		Check that `/var/lib/prometheus/node-exporter/tpa_backuppg.prom` has
		the right timestamp, for example compare this:

		tpa_backuppg_last_check_timestamp_seconds 1728495929.425171

		With the current timestamp:

		```
		root@bungei:~# date +%s
		1728511576
		```

		In this case, we have a 15647 lag, which is more than 4 hours.

		You should be able to run the check by hand with the command:

		```
		sudo -u torbackup /usr/lib/nagios/plugins/dsa-check-backuppg -e
		```

		It shouldn't return any output, but it should update the above
		timestamp.

		In this case, the schedule in the cron job was incorrect. It was not
		found because Nagios was running the check through NRPE and hiding the
		problem, the schedule was changed to run every 15 minutes instead.

		## Disaster recovery

		If a PostgreSQL server is destroyed completely or in part, we need to