TPA-RFC-65: replace our bespoke postgresql backup system
i have just found out about barman a PostgreSQL backup system which is pretty close to the bespoke system we're using at TPA. Except it's actively developed, commercially supported, packaged in Debian, and generally pretty damn solid.
Consider replacing our tool with this, Not sure what process we should use for this, but i would probably need to setup a must have/nice to have/no goal spec, and, yes, another damn RFC.
For now, I've just documented various tools I found yesterday searching around the interweb in the wiki here:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/postgresql#backup-systems
There was a proposal made for this, see: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-65-postgresql-backups
excerpt:
Phase I: alpha testing
Migrate the following backups from bungei to backup-storage-01:
weather-01 (12.7GiB) rude (35.1GiB) materculae (151.9GiB) test restores! see checklist below Phase II: beta testing
After a week, retire the above backups from bungei, then migrate the following servers:
gitlab-02 (34.9GiB)to be migrated to gnt-dalpolyanthum (20.3GiB) meronense (505.1GiB) Phase III: production
After another week, migrate the last backups from bungei:
bacula-director-01 (180.8GiB) At this point, we should hopefully have enough room on the backup server to survive the holidays.
Phase IV: retire legacy, bungei replacement
At this point, the only backups using the legacy system are the ones from the gnt-dal cluster (4 servers). Rebuild those with the new service. Do not keep a copy of the legacy system on bungei (to save space, particularly for metricsdb-01) but possibly archive a copy of the legacy backups on backup-storage-01:
metricsdb-01 (1.6TiB) puppetdb-01 (20.2GiB) survey-01 (5.7GiB) anonticket-01 (3.9GiB) gitlab-02 (34.9GiB) If we still run out of disk space on bungei, consider replacing the server entirely. The server is now 5 years old which is getting close to our current amortization time (6 years) and it's a rental server so it's relatively easy to replace, as we don't need to buy new hardware.
Basic setup (done)
-
pgbarman testing and manual setup -
pgbarman puppetization, consider usingused deric-barman, see also this search for barman on the forge -
abort the barman test after too many failures, see #40950 (comment 3082333) -
evaluate https://pgbackrest.org/ as a replacement, check requirements -
find puppet module (attestra/pgbackrest exists, but too minimal) -
puppetize pgbackrest (TODOs remain in the pgbackrest module, but considered low priority) -
expiration (current policy is "21 days") -
test restores -
simple restore on existing server -
bare bones restore without pgbackrest
-
-
compare performance with legacy system, consider optimisations like parallel backups, incrementals, block dedupe, etc -
monitoring -
prometheus exporter: exporter only in trixie, missing service file: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1087805) -
grafahan dashboard: https://grafana.torproject.org/d/Wyf8STx7z/pgbackrest-exporter-dashboard -
alerting: prometheus-alerts!62 (merged)
-
-
progressive deployment on all servers, phases I, II, III (see checklist above) -
cleanup .old
archives on bungei (scheduled in at(1), next jobs to be scheduled by hand) -
improve rotation schedules: add daily incr
backups, movediff
to weekly andfull
to monthly, moving retention to 30 days -
documentation overhaul, add pgbackrest details to: -
pager playbook -
running a full backup -
basic restore -
disaster recovery -
monitoring -
review the backups.md file as well
-
Legacy cleanup (postponed to 2025)
-
replacement of current pg backups on bungei (phase IV above) -
legacy code cleanup (includes tor-puppet and might allow for removal of tor-nagios-checks on more servers, see also #41671, might need a "power grep") -
remove or replace references to legacy system in docs, review the entire document -
give the whole postgresql docs a read, in particular reference and discussion -
"direct backup recovery" (archive) -
"indirect backup recovery" (archive)
-
-
bungei replacement or resizing (see also #41364 (closed)) -
fancy restore test extras -
restore on new server (not barebones, requires provisionning a server with Puppet and LDAP, so annoying) -
PITR restores (AKA "go back in time")
-
-
if time permits, finish TODO items in pgbackrest module
Note, for the requirements (https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-65-postgresql-backups#goals) that there's one requirement that was unwritten there and that is that the backup system must have similar properties as the current one, which is that the backup server pulls base backups, with minimal privileges, from the database server