we are currently backing up everything in GitLab twice: once through Bacula, and another time through the gitlab-backup script. in #40517 (closed) we at least pulled artifacts out of this, but we should think real hard about whether or not we need the gitlab-backup script at all, because it duplicates things and wastes CPU cycles.
my preference would be to have rotating ZFS snapshots (because LVS would be too costly in performance) on this server. one snapshot every 10 minute for the last 10 minute, another every hour, for the last 24h, and then Bacula backs up the latest available snapshot. that way bacula backups are consistent. we could even implement some flushing of the postgresql database to ensure it's consistent as well.
this would completely remove the need for the gitlab-backup script, and would also mitigate gitlab#20 (moved) to a certain extent.
it's a significant re-engineering effort, however: it might be simpler to just implement gitlab#20 (moved) and use regular postgresql backups combined with bacula, and hope for the best in terms of consistency. we use GitLab so much though that I would really like to be able to easily go back in time in smaller chunks than what bacula offers.
the backups to review are:
PostgreSQL databases: moved to our normal backup system (#41426 (closed))
Git repositories: covered by bacula, risk of "corrupt" git repositories on disaster recovery (e.g. partial writes like "a ref was uploaded but not its blob" or "a part of a blob was uploaded"), see https://gitlab.com/gitlab-org/gitlab/-/issues/432743 for a discussion
Blobs: currently on disk, assumed to be safe to backup by bacula, but also covered by the rake task, could be moved to object storage and rely on that for backups, those are:
uploads
builds
artifacts
pages
lfs
terraform states (!?)
packages
ci secure files
Container registry: same, currently in object storage without backups
Configuration files: backed up by bacula, assumed safe
Other data: mainly redis for the job queue and elastic search for the advanced search, we don't use the latter and the former we could probably live without
in #40615 (closed) we ran out of space again, but this time only in the "artifacts" filesystem. we've somewhat recovered and hopefully implemented some expiration policies that will keep us running for a bit (but that remains to be seen).
Not directly related to backups of course, but I should point out that backups still double space on the server, and therefore take up valuable space that we could otherwise dedicate to our users instead:
in the above, /srv/gitlab-shared is only artifacts, which are not backed up on /srv/gitlab-backup. it's also unclear if we'd be able to actually restore this server from backups in an emergency, because of the possible inconsistencies between artifacts and backups...
i should also mention that we added another 30GB to that gitlab-shared partition, which brings us pretty close to not having enough space to rebuild a new filesystem on the side of the existing disk:
if we look at the Used column above, that adds up to 357G, so it's still *doable, but we only have ~60G loose, and if we have different partitions, that would need to be split up too. So not a trivial switch.
It generally seems like artifacts storage size is unbounded and only expected to grow. It might be that we can only fix this problem by addressing the "large storage" problem (#40478 (closed)) as well.
Finally, I should mention that I've so far assume that we would reduce the above physical volume allocation (/dev/sdd) to free up 420G but there's probably a simpler solution for this: just create a new disk from the Ganeti node and assign it to the instance, creating a clean filesystem on it. In fact, we could just create a completely new VM (running bullseye, why not!) with a normal LVM root filesystem and GitLab's /opt and /srv on ZFS (on SSD and HDD vdevs, respectively).
So maybe we're being silly by keeping space on that PV after all, and we should just not block on this for normal maintenance operations.
we stumbled upon this problem again (#40744 (closed)) and this time it was the repositories storage that exploded. somehow, in /srv/gitlab-backup/repositories there was 280GB of repositories even though the original is still less than 100GB (!?). This is possibly related with the incremental backup mechanism introduced in 14.10. the fix was to exclude repositories from the gitlab rake job, as they are already backed up by bacula anyways.
more broadly, taking a step back here, i think that filesystem snapshots would allow us to have consistency between the different components, but that backup script is mostly useful because it gives us a plain SQL copy of the postgresql database. everything else is just files, and while those can be inconsistent somehow, that shouldn't be that much of a deal in a disaster recovery situation. a corrupt PostgreSQL database, however, is another story, although I bet it could recover from a half-assed bacula backup somewhat correctly.
a guess, however, is not as good as a SQL dump, so i'm keeping that backup script for now, even though it still duplicates a lot of data.
in #41402 (closed), I've noticed the backup partition is virtually empty, so this has become more urgent: it seems this was rethought indeed, but accidentally, which is not "thinking" at all.
root@gitlab-02:/srv/gitlab-backup# du -sch * | sort -h4.0K backup_information.yml4.0K ci_secure_files.tar.gz4.0K registry.tar.gz4.0K terraform_state.tar.gz3.5M builds.tar.gz353M lfs.tar.gz1.1G 1700030933_2023_11_15_16.5.1_gitlab_backup.tar1.1G 1700228908_2023_11_17_16.5.2_gitlab_backup.tar1.1G db3.1G uploads.tar.gz5.1G packages.tar.gz38G pages.tar.gz50G total
mysteriously, the .tar files only contain the database:
root@gitlab-02:/srv/gitlab-backup# ls -alhStotal 49G-rw------- 1 git git 38G Nov 22 00:39 pages.tar.gz-rw------- 1 git git 5.1G Nov 22 00:43 packages.tar.gz-rw------- 1 git git 3.1G Nov 22 00:07 uploads.tar.gz-rw------- 1 git git 1.1G Nov 17 13:51 1700228908_2023_11_17_16.5.2_gitlab_backup.tar-rw------- 1 git git 1.1G Nov 15 06:51 1700030933_2023_11_15_16.5.1_gitlab_backup.tar-rw------- 1 git git 353M Nov 22 00:39 lfs.tar.gz-rw------- 1 git git 3.5M Nov 22 00:07 builds.tar.gzdrwx------ 3 git git 4.0K Nov 22 00:43 .drwxr-xr-x 6 root root 4.0K Jun 28 2022 ..drwx------ 2 git git 4.0K Nov 22 00:02 db-rw------- 1 git git 3.5K Nov 22 00:43 ci_secure_files.tar.gz-rw------- 1 git git 466 Nov 22 00:43 backup_information.yml-rw------- 1 git git 153 Nov 22 00:39 registry.tar.gz-rw------- 1 git git 149 Nov 22 00:39 terraform_state.tar.gz
i have verified that those files in /srv/gitlab-backup are picked up by bacula, so that's good.
i guess that most of the space originally used in the backup partition there was due to artifacts and that has been out of backups for years at this point (since #40517 (closed)). at least disk usage has been stable at around 20GB for a year, rising recently probably because of #41402 (closed), although that would be strange precisely because we're not backing up artifacts in that partition.
i'm not sure to what to do about this ticket. maybe we could go through each one of those tar files above and revise whether or not they need to be individually backed up by gitlab or if the bacula backups are sufficient.
at the very least the database backups are required here, but that's covered by gitlab#20 (moved).
fundamentally, the question here is what needs to be backed up as per the upstream fine manual. upstream says:
PostgreSQL databases: covered by current backup script, could be externalized (gitlab#20 (moved))
Git repositories: not covered by current script, assumed bacula works, but actually this is problematic: gitaly needs to be stopped before backups can be performed consistently, so we actually need to either do that (!) or re-enable the script... there's actually a contradiction in documentation about this, i filed a ticket. it looks like the solution here is to use object storage to do server-side repository backups
Blobs: currently on disk, assumed to be safe to backup by bacula, but also covered by the rake task, could be moved to object storage and rely on that for backups
Container registry: same, currently in object storage without backups
Configuration files: backed up by bacula, assumed safe
Other data: mainly redis for the job queue and elastic search for the advanced search, we don't use the latter and the former we could probably live without
So, i think the immediate fix here is to switch Gitaly's backup system to object storage. this can be done even if we don't have backups in the object storage, i think, because the object storage itself is the backup.
i opened #41425 (closed) to move backups to object storage, so that's covered. what remains here is the overlap between the rake backup task and our standard backups. we might be able to simply disable the backup job once postgresql backups are covered (#41426 (closed)).
so that is the next step: go through all components covered by the rake task and retire it when it is confirmed they are covered by another backup system. at this point, it is believed gitaly (#41425 (closed)) and postgresql (#41426 (closed)) need to be moved, but another check need to be done before the retirement.
i opened #41425 (closed) to move backups to object storage, so that's covered. what remains here is the overlap between the rake backup task and our standard backups. we might be able to simply disable the backup job once postgresql backups are covered (#41426 (closed)).
That kind of failed miserably. So we're back to square one on this one, hoping to get a response in https://gitlab.com/gitlab-org/gitlab/-/issues/432743 that will reassure us that bacula can just do the job for now.