bungei is filling up again
bungei has been flapping a little in our monitoring:
20:02:51 -ALERTOR1:#tor-alerts- DiskWillFillSoon [firing] Disk /srv/backups/bacula on bungei.torproject.org is almost full
09:12:52 -ALERTOR1:#tor-alerts- DiskWillFillSoon [resolved] Disk /srv/backups/bacula on bungei.torproject.org is almost full
Steps to reproduce
just stare at monitoring, it's right there! ;)
What is the current bug behavior?
our backups may fill up.
What is the expected correct behavior?
we shouldn't have to worry about this for the next year.
When did this start?
the alert happened today, but of course disk usage is constantly growing all around.
Relevant logs and/or screenshots
so. this is where i get to nerd out on grafana, check this out.
this is the disk usage of /srv/backup/bacula on bungei in the past year:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&refresh=auto&var-class=%24__all&var-instance=bungei.torproject.org&timezone=utc&var-Filters=mountpoint%7C%3D%7C%2Fsrv%2Fbackups%2Fbacula&viewPanel=panel-4-clone-0
a couple interesting things here:
- our "diff" is 19.2TiB which means we gained 19.2TiB in the past 12 months, more than a terabyte growth per month!
- we have funky spikes in there: those spikes are 7TiB tall! they don't quite happen monthly: they happened 9 times in the past 12 months, and are the main cause of this alert
- if we ignore that spike, naturally, we have a 12TiB growth in the past 12 months, so 1TiB/mth
- at this rate, we could still survive 11 months if we ignore the spike
- if we don't, we can still survive 5 months, but we'll get lots of noise from monitoring
here's our growth in overall fleet disk usage in the past year, if we exclude bungei, director, and backup-storage-01 (aka "the backup system"):
and this is this stupidly long URL because grafana can't do excludes.
there we have a +18TiB growth over the fleet, so the growth is expected / normal.
Possible fixes
we have this shiny new backup-storage-01 server sitting at quintex with 34TiB of unallocated LVM space:
https://grafana.torproject.org/d/f7887271-1a77-4138-ad16-28be8b0ad0ab/lvm-disk-usage?orgId=1&from=now-7d&to=now&timezone=browser&var-class=role%3A%3Abackup%3A%3Astorage&var-class=role%3A%3Abackup%3A%3Astorage2024&var-vg_name=%24__all&var-instance=%24__all
so maybe one solution would be to finally start migrating some backups to backup-storage-01 in the next weeks/months?