bungei is filling up again

bungei has been flapping a little in our monitoring:

20:02:51 -ALERTOR1:#tor-alerts- DiskWillFillSoon [firing] Disk /srv/backups/bacula on bungei.torproject.org is almost full
09:12:52 -ALERTOR1:#tor-alerts- DiskWillFillSoon [resolved] Disk /srv/backups/bacula on bungei.torproject.org is almost full

Steps to reproduce

just stare at monitoring, it's right there! ;)

What is the current bug behavior?

our backups may fill up.

What is the expected correct behavior?

we shouldn't have to worry about this for the next year.

When did this start?

the alert happened today, but of course disk usage is constantly growing all around.

Relevant logs and/or screenshots

so. this is where i get to nerd out on grafana, check this out.

this is the disk usage of /srv/backup/bacula on bungei in the past year:

https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&refresh=auto&var-class=%24__all&var-instance=bungei.torproject.org&timezone=utc&var-Filters=mountpoint%7C%3D%7C%2Fsrv%2Fbackups%2Fbacula&viewPanel=panel-4-clone-0

a couple interesting things here:

our "diff" is 19.2TiB which means we gained 19.2TiB in the past 12 months, more than a terabyte growth per month!
we have funky spikes in there: those spikes are 7TiB tall! they don't quite happen monthly: they happened 9 times in the past 12 months, and are the main cause of this alert
if we ignore that spike, naturally, we have a 12TiB growth in the past 12 months, so 1TiB/mth
at this rate, we could still survive 11 months if we ignore the spike
if we don't, we can still survive 5 months, but we'll get lots of noise from monitoring

here's our growth in overall fleet disk usage in the past year, if we exclude bungei, director, and backup-storage-01 (aka "the backup system"):

and this is this stupidly long URL because grafana can't do excludes.

there we have a +18TiB growth over the fleet, so the growth is expected / normal.

Possible fixes

we have this shiny new backup-storage-01 server sitting at quintex with 34TiB of unallocated LVM space:

https://grafana.torproject.org/d/f7887271-1a77-4138-ad16-28be8b0ad0ab/lvm-disk-usage?orgId=1&from=now-7d&to=now&timezone=browser&var-class=role%3A%3Abackup%3A%3Astorage&var-class=role%3A%3Abackup%3A%3Astorage2024&var-vg_name=%24__all&var-instance=%24__all

so maybe one solution would be to finally start migrating some backups to backup-storage-01 in the next weeks/months?

Edited Sep 02, 2025 by anarcat