rethink gitlab backup strategy
we are currently backing up everything in GitLab twice: once through Bacula, and another time through the gitlab-backup
script. in #40517 (closed) we at least pulled artifacts out of this, but we should think real hard about whether or not we need the gitlab-backup
script at all, because it duplicates things and wastes CPU cycles.
my preference would be to have rotating ZFS snapshots (because LVS would be too costly in performance) on this server. one snapshot every 10 minute for the last 10 minute, another every hour, for the last 24h, and then Bacula backs up the latest available snapshot. that way bacula backups are consistent. we could even implement some flushing of the postgresql database to ensure it's consistent as well.
this would completely remove the need for the gitlab-backup
script, and would also mitigate gitlab#20 (moved) to a certain extent.
it's a significant re-engineering effort, however: it might be simpler to just implement gitlab#20 (moved) and use regular postgresql backups combined with bacula, and hope for the best in terms of consistency. we use GitLab so much though that I would really like to be able to easily go back in time in smaller chunks than what bacula offers.
the backups to review are:
-
PostgreSQL databases: moved to our normal backup system (#41426 (closed)) -
Git repositories: covered by bacula, risk of "corrupt" git repositories on disaster recovery (e.g. partial writes like "a ref was uploaded but not its blob" or "a part of a blob was uploaded"), see https://gitlab.com/gitlab-org/gitlab/-/issues/432743 for a discussion -
Blobs: currently on disk, assumed to be safe to backup by bacula, but also covered by the rake task, could be moved to object storage and rely on that for backups, those are: -
uploads -
builds -
artifacts -
pages -
lfs -
terraform states (!?) -
packages -
ci secure files
-
-
Container registry: same, currently in object storage without backups -
Configuration files: backed up by bacula, assumed safe -
Other data: mainly redis for the job queue and elastic search for the advanced search, we don't use the latter and the former we could probably live without