diff --git a/howto/postgresql.md b/howto/postgresql.md index 32287e21228cdeaee7ad8145a49b46708b428997..a6446d5316c909b9d5b72f4fa5d4dcb90b453800 100644 --- a/howto/postgresql.md +++ b/howto/postgresql.md @@ -1694,12 +1694,94 @@ can reproduce the issue, through the systemd unit. For example, a See the [Running a backup manually instructions](#running-a-backup-manually) for details. +Note that this can also happen when a new server is provisioned or the +backup schedules have been changed, see below for that special case. + +Note that the `pgbackrest_exporter` only pulls metrics from pgBackRest +once per `--collect.interval` which defaults to 600 seconds (10 +minutes), so it might take unexpectedly long for an alert to resolve. + +#### Rescheduling issues + It's possible this alert gets raised soon after a server is first provisioned, because of weird corner cases in systemd's `OnCalendar` implementation, see [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40950#note_3142800), if that happens. In such cases, the install instructions should be tweaked to schedule those backups in the proper time. +This situation can also occur when backups schedules change. In that +case, the systemd unit get recreated and that throws systemd off +guard. The solution, in this case, is to manually examine the +`list-timers` output and manually schedule backups to fill in the +gaps. Fore example, given this output: + +``` +root@backup-storage-01:~# systemctl list-timers | grep -e NEXT -e pgbackrest +NEXT LEFT LAST PASSED UNIT ACTIVATES +Wed 2025-01-29 17:54:57 UTC 1 day 21h left Wed 2025-01-22 17:55:01 UTC 5 days ago pgbackrest-backup-diff@bacula-director-01.timer pgbackrest-backup-diff@bacula-director-01.service +Wed 2025-01-29 20:24:06 UTC 1 day 23h left Wed 2025-01-22 20:24:18 UTC 5 days ago pgbackrest-backup-diff@polyanthum.timer pgbackrest-backup-diff@polyanthum.service +Fri 2025-01-31 16:23:45 UTC 3 days left Mon 2025-01-20 14:56:18 UTC 1 week 0 days ago pgbackrest-backup-diff@rude.timer pgbackrest-backup-diff@rude.service +Sun 2025-02-02 00:18:55 UTC 5 days left Tue 2025-01-21 22:52:05 UTC 5 days ago pgbackrest-backup-diff@weather-01.timer pgbackrest-backup-diff@weather-01.service +Sun 2025-02-02 15:30:00 UTC 5 days left Thu 2025-01-02 15:30:01 UTC 3 weeks 4 days ago pgbackrest-backup-full@polyanthum.timer pgbackrest-backup-full@polyanthum.service +Mon 2025-02-03 09:45:41 UTC 6 days left Mon 2025-01-27 09:45:44 UTC 10h ago pgbackrest-backup-diff@materculae.timer pgbackrest-backup-diff@materculae.service +Mon 2025-02-03 17:20:17 UTC 6 days left Mon 2025-01-27 17:20:18 UTC 3h 10min ago pgbackrest-backup-diff@meronense.timer pgbackrest-backup-diff@meronense.service +Thu 2025-02-06 03:35:54 UTC 1 week 2 days left Mon 2025-01-06 03:36:05 UTC 3 weeks 0 days ago pgbackrest-backup-full@materculae.timer pgbackrest-backup-full@materculae.service +Sat 2025-02-08 05:35:01 UTC 1 week 4 days left Wed 2025-01-08 05:35:05 UTC 2 weeks 5 days ago pgbackrest-backup-full@bacula-director-01.timer pgbackrest-backup-full@bacula-director-01.service +Fri 2025-02-14 03:00:47 UTC 2 weeks 3 days left Tue 2025-01-14 03:00:50 UTC 1 week 6 days ago pgbackrest-backup-full@rude.timer pgbackrest-backup-full@rude.service +Mon 2025-02-17 18:38:18 UTC 2 weeks 6 days left Tue 2025-01-14 23:49:54 UTC 1 week 5 days ago pgbackrest-backup-full@weather-01.timer pgbackrest-backup-full@weather-01.service +Wed 2025-02-26 14:05:39 UTC 4 weeks 1 day left Wed 2025-01-15 03:07:09 UTC 1 week 5 days ago pgbackrest-backup-full@meronense.timer pgbackrest-backup-full@meronense.service +``` + +We can see a few problems: + + 1. `pgbackrest-backup-diff@rude.timer` is `PASSED` a week ago, but + still has 3 days left + + 2. `pgbackrest-backup-diff@weather-01.time` is `PASSED` 5 days ago + but still has 5 days left + + 3. (not seen) the `weather-01` backup is too old (even though the + timer is `PASSWD` 1 week 5 days ago) + +The solution, given the above, was: + +``` +systemctl start pgbackrest-backup-diff@rude.service +systemctl start pgbackrest-backup-full@weather-01.service +systemctl start pgbackrest-backup-diff@weather-01.service +``` + +But that was not complete. The `weather-01` case hints at the +fundamental problem: the scheduler (systemd) is decoupled from the +actual source of truth of the most recent backups (pgBackRest). So +even though the timer "thinks" it ran (say) "1 week 5 days ago", in +reality, that's when the timer *started*, not when the actual job +ran. + +In the case of `pgbackrest-backup-full@meronense.service`, for +example, the actual backups set is: + +``` +root@backup-storage-01:~# for stanza in meronense.torproject.org; do hostname=$(basename $stanza .torproject.org); echo $hostname; sudo -u pgbackrest-$hostname pgbackrest --stanza=$stanza info | grep backup: | grep -v incr; done +meronense + full backup: 20241227-213304F + diff backup: 20241227-213304F_20241230-172036D + diff backup: 20241227-213304F_20250106-172049D + diff backup: 20241227-213304F_20250113-172051D + diff backup: 20241227-213304F_20250120-172023D + diff backup: 20241227-213304F_20250127-172022D +``` + +ie. the last full backup is 20241227, which is, at the time of writing +(2024-01-27), a full month ago, not "1 week 5 days ago". So we also +need to look out for those: a full backup was scheduled for meronense +to work around that issue. + +In general, a good rule of thumb when rescheduling backups is to look +at how much time there is in the `LEFT` column. If that's longer than +the expected interval, schedule a backup with `at(1)` or `systemd-run +--on-calendar` with a time set to fix the discrepancy. + #### Backups stale (legacy) <a name="backups-stale" />