TPA-RFC-65: replace our bespoke postgresql backup system
i have just found out about barman a PostgreSQL backup system which is pretty close to the bespoke system we're using at TPA. Except it's actively developed, commercially supported, packaged in Debian, and generally pretty damn solid.
Consider replacing our tool with this, Not sure what process we should use for this, but i would probably need to setup a must have/nice to have/no goal spec, and, yes, another damn RFC.
For now, I've just documented various tools I found yesterday searching around the interweb in the wiki here:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/postgresql#backup-systems
There was a proposal made for this, see: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-65-postgresql-backups
excerpt:
Phase I: alpha testing
Migrate the following backups from bungei to backup-storage-01:
weather-01 (12.7GiB) rude (35.1GiB) materculae (151.9GiB) Phase II: beta testing
After a week, retire the above backups from bungei, then migrate the following servers:
gitlab-02 (34.9GiB) polyanthum (20.3GiB) meronense (505.1GiB) Phase III: production
After another week, migrate the last backups from bungei:
bacula-director-01 (180.8GiB) At this point, we should hopefully have enough room on the backup server to survive the holidays.
Phase IV: retire legacy, bungei replacement
At this point, the only backups using the legacy system are the ones from the gnt-dal cluster (4 servers). Rebuild those with the new service. Do not keep a copy of the legacy system on bungei (to save space, particularly for metricsdb-01) but possibly archive a copy of the legacy backups on backup-storage-01:
metricsdb-01 (1.6TiB) puppetdb-01 (20.2GiB) survey-01 (5.7GiB) anonticket-01 (3.9GiB) If we still run out of disk space on bungei, consider replacing the server entirely. The server is now 5 years old which is getting close to our current amortization time (6 years) and it's a rental server so it's relatively easy to replace, as we don't need to buy new hardware.
Next steps:
-
pgbarman testing and manual setup -
pgbarman puppetization, consider useding~~ deric-barman, see also this search for barman on the forge -
abort the barman test after too many failures, see #40950 (comment 3082333) -
evaluate https://pgbackrest.org/ as a replacement, check requirements -
find puppet module (attestra/pgbackrest exists, but too minimal) -
puppetize pgbackrest (see local TODO) -
test and deploy exporter -
progressive deployment on all servers (see checklist above) -
legacy code cleanup (see also #41671) -
bungei replacement or resizing (see also #41364 (closed)) -
documentation updates
Note, for the requirements (https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-65-postgresql-backups#goals) that there's one requirement that was unwritten there and that is that the backup system must have similar properties as the current one, which is that the backup server pulls base backups, with minimal privileges, from the database server