Loading howto/postgresql.md +44 −0 Original line number Diff line number Diff line Loading @@ -1550,6 +1550,50 @@ A few ideas: - consider bumping the `max_connections` setting (in `postgresql.conf`) if this is a long term trend ### Backups stale The `PgLegacyBackupsStale` alert looks like this: PostgreSQL backup checks are stale on test.torproject.org This implies the backup checks have not ran recently enough. This implies that we might not have accurate information on our backups and should be fixed. Those metrics are exported by `/usr/lib/nagios/plugins/dsa-check-backuppg` script which is called from the `/etc/cron.d/tor-backup-postgres` cron job, deployed by Puppet. It writes to `/var/lib/prometheus/node-exporter/tpa_backuppg.prom` which gets collected by the node exporter "text file collector". Check that `/var/lib/prometheus/node-exporter/tpa_backuppg.prom` has the right timestamp, for example compare this: tpa_backuppg_last_check_timestamp_seconds 1728495929.425171 With the current timestamp: ``` root@bungei:~# date +%s 1728511576 ``` In this case, we have a 15647 lag, which is more than 4 hours. You should be able to run the check by hand with the command: ``` sudo -u torbackup /usr/lib/nagios/plugins/dsa-check-backuppg -e ``` It shouldn't return any output, but it *should* update the above timestamp. In this case, the schedule in the cron job was incorrect. It was not found because Nagios was running the check through NRPE and hiding the problem, the schedule was changed to run every 15 minutes instead. ## Disaster recovery If a PostgreSQL server is destroyed completely or in part, we need to Loading Loading
howto/postgresql.md +44 −0 Original line number Diff line number Diff line Loading @@ -1550,6 +1550,50 @@ A few ideas: - consider bumping the `max_connections` setting (in `postgresql.conf`) if this is a long term trend ### Backups stale The `PgLegacyBackupsStale` alert looks like this: PostgreSQL backup checks are stale on test.torproject.org This implies the backup checks have not ran recently enough. This implies that we might not have accurate information on our backups and should be fixed. Those metrics are exported by `/usr/lib/nagios/plugins/dsa-check-backuppg` script which is called from the `/etc/cron.d/tor-backup-postgres` cron job, deployed by Puppet. It writes to `/var/lib/prometheus/node-exporter/tpa_backuppg.prom` which gets collected by the node exporter "text file collector". Check that `/var/lib/prometheus/node-exporter/tpa_backuppg.prom` has the right timestamp, for example compare this: tpa_backuppg_last_check_timestamp_seconds 1728495929.425171 With the current timestamp: ``` root@bungei:~# date +%s 1728511576 ``` In this case, we have a 15647 lag, which is more than 4 hours. You should be able to run the check by hand with the command: ``` sudo -u torbackup /usr/lib/nagios/plugins/dsa-check-backuppg -e ``` It shouldn't return any output, but it *should* update the above timestamp. In this case, the schedule in the cron job was incorrect. It was not found because Nagios was running the check through NRPE and hiding the problem, the schedule was changed to run every 15 minutes instead. ## Disaster recovery If a PostgreSQL server is destroyed completely or in part, we need to Loading