hetzner-nbg1-01 / prometheus1 running out of disk space
back in May 2020, we had disk space issues with the prometheus server, at which point we double its disk, from 80GB to 160GB:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-4-prometheus-disk
since then, things have been mostly quiet, but we started receiving warnings from Icinga again recently:
Sun. 11:09 [1/1] nagios@hetzner-hel1-01.torproject.org ** RECOVERY Service Alert: hetzner-nbg1-01/disk usage on / is OK ** (nagios rapports tor)
July 18 [1/1] nagios@hetzner-hel1-01.torproject.org ** PROBLEM Service Alert: hetzner-nbg1-01/disk usage on / is WARNING ** (nagios rapports tor)
July 17 [1/1] nagios@hetzner-hel1-01.torproject.org ** PROBLEM Service Alert: hetzner-nbg1-01/disk usage on / is WARNING ** (nagios rapports tor)
July 17 [1/1] nagios@hetzner-hel1-01.torproject.org ** RECOVERY Service Alert: hetzner-nbg1-01/disk usage on / is OK ** (nagios rapports tor)
July 17 [1/1] nagios@hetzner-hel1-01.torproject.org ** PROBLEM Service Alert: hetzner-nbg1-01/disk usage on / is WARNING ** (nagios rapports tor)
July 17 [1/1] nagios@hetzner-hel1-01.torproject.org ** RECOVERY Service Alert: hetzner-nbg1-01/disk usage on / is OK ** (nagios rapports tor)
July 17 [1/1] nagios@hetzner-hel1-01.torproject.org ** PROBLEM Service Alert: hetzner-nbg1-01/disk usage on / is WARNING ** (nagios rapports tor)
July 04 [1/1] nagios@hetzner-hel1-01.torproject.org ** RECOVERY Service Alert: hetzner-nbg1-01/disk usage on / is OK ** (nagios rapports tor)
July 02 [1/1] nagios@hetzner-hel1-01.torproject.org ** PROBLEM Service Alert: hetzner-nbg1-01/disk usage on / is WARNING ** (nagios rapports tor)
July 02 [1/1] nagios@hetzner-hel1-01.torproject.org ** PROBLEM Service Alert: hetzner-nbg1-01/disk usage on / is WARNING ** (nagios rapports tor)
July 02 [1/1] nagios@hetzner-hel1-01.torproject.org ** RECOVERY Service Alert: hetzner-nbg1-01/disk usage on / is OK ** (nagios rapports tor)
July 02 [1/1] nagios@hetzner-hel1-01.torproject.org ** PROBLEM Service Alert: hetzner-nbg1-01/disk usage on / is WARNING ** (nagios rapports tor)
this is what it looks like in Grafana:
it's a typical "sawtooth graph" where the disk usage increases regularly, but gets compacted by prometheus. there's a worrisome downwards trend however, and i am not exactly sure what it's due to. maybe we're rotating machines and it's duplicating records? i often see discontinuity in colors in grafana as well, which is presumably due to label changes, maybe those could have an impact?
we have taken ~25GiB in the last year (51.9GiB - 27.4GiB = 24.5GiB, AKA "max available - min available"), and have currently 32.4GB left (although we should probably look at the min 27.4GB there instead). so by that count, we should probably have at least another year to deal with this before an outage, assuming this is a linear increase.
note that we have already tweaked the reserved block count on this server, so that's not an option:
root@hetzner-nbg1-01:~# dumpe2fs /dev/sda1 | grep -i 'block count'
dumpe2fs 1.46.2 (28-Feb-2021)
Block count: 40000251
Reserved block count: 100001
anarcat@curie:~$ echo 100001/40000251 | qalc
> 100001/40000251
100001 / 40000251 = approx. 0,0025000093
anarcat@curie:~$ echo '100001 * 4096 bytes' | qalc
> 100001 * 4096 bytes
100001 * (4096 * byte) = approx. 409,6041 megabytes
ie. that's 0.25%, or 500meg, which is probably enough, and lowering that won't buy us much time anyways.