Skip to content

TPA-RFC-65: replace our bespoke postgresql backup system

i have just found out about barman a PostgreSQL backup system which is pretty close to the bespoke system we're using at TPA. Except it's actively developed, commercially supported, packaged in Debian, and generally pretty damn solid.

Consider replacing our tool with this, Not sure what process we should use for this, but i would probably need to setup a must have/nice to have/no goal spec, and, yes, another damn RFC.

For now, I've just documented various tools I found yesterday searching around the interweb in the wiki here:

https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/postgresql#backup-systems

There was a proposal made for this, see: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-65-postgresql-backups

excerpt:

Phase I: alpha testing

Migrate the following backups from bungei to backup-storage-01:

  • weather-01 (12.7GiB)
  • rude (35.1GiB)
  • materculae (151.9GiB)
  • test restores! see checklist below

Phase II: beta testing

After a week, retire the above backups from bungei, then migrate the following servers:

  • gitlab-02 (34.9GiB) to be migrated to gnt-dal
  • polyanthum (20.3GiB)
  • meronense (505.1GiB)

Phase III: production

After another week, migrate the last backups from bungei:

  • bacula-director-01 (180.8GiB)
  • lists-01

At this point, we should hopefully have enough room on the backup server to survive the holidays.

Phase IV: retire legacy, bungei replacement

At this point, the only backups using the legacy system are the ones from the gnt-dal cluster (4 servers). Rebuild those with the new service. Do not keep a copy of the legacy system on bungei (to save space, particularly for metricsdb-01) but possibly archive a copy of the legacy backups on backup-storage-01:

  • metricsdb-01 (1.6TiB) (backups disabled)
  • puppetdb-01 (20.2GiB)
  • survey-01 (5.7GiB)
  • anonticket-01 (3.9GiB)
  • gitlab-02 (34.9GiB)

If we still run out of disk space on bungei, consider replacing the server entirely. The server is now 5 years old which is getting close to our current amortization time (6 years) and it's a rental server so it's relatively easy to replace, as we don't need to buy new hardware.

Basic setup (done)

  • pgbarman testing and manual setup
  • pgbarman puppetization, consider using used deric-barman, see also this search for barman on the forge
  • abort the barman test after too many failures, see #40950 (comment 3082333)
  • evaluate https://pgbackrest.org/ as a replacement, check requirements
  • find puppet module (attestra/pgbackrest exists, but too minimal)
  • puppetize pgbackrest (TODOs remain in the pgbackrest module, but considered low priority)
  • expiration (current policy is "21 days")
  • test restores
    • simple restore on existing server
    • bare bones restore without pgbackrest
  • compare performance with legacy system, consider optimisations like parallel backups, incrementals, block dedupe, etc
  • monitoring
  • progressive deployment on all servers, phases I, II, III (see checklist above)
  • cleanup .old archives on bungei (scheduled in at(1), next jobs to be scheduled by hand)
  • improve rotation schedules: add daily incr backups, move diff to weekly and full to monthly, moving retention to 30 days
  • documentation overhaul, add pgbackrest details to:
    • pager playbook
    • running a full backup
    • basic restore
    • disaster recovery
    • monitoring
    • review the backups.md file as well

Legacy cleanup

  • switch default from legacy to pgbackrest (lists-01 was installed with legacy instead of pgbackrest)
  • replacement of current pg backups on bungei (phase IV above)
  • legacy code cleanup (includes tor-puppet and might allow for removal of tor-nagios-checks on more servers, see also #41671, might need a "power grep")
  • remove or replace references to legacy system in docs, review the entire document
    • give the whole postgresql docs a read, in particular reference and discussion
    • "direct backup recovery" (archive)
    • "indirect backup recovery" (archive)
  • bungei replacement or resizing (see also #41364 (closed))

stretch goals

  • fancy restore test extras
    • restore on new server (not barebones, requires provisionning a server with Puppet and LDAP, so annoying)
    • PITR restores (AKA "go back in time")
  • if time permits, finish TODO items in pgbackrest module
  • consider TLS mode, after repeated comments on the pgbackrest module PR

Note, for the requirements (https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-65-postgresql-backups#goals) that there's one requirement that was unwritten there and that is that the backup system must have similar properties as the current one, which is that the backup server pulls base backups, with minimal privileges, from the database server

Edited by anarcat
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information