disk space was fairly constant until june, and then it started increasing semi-linearly. we gained 30G since then with ~8GB left. we lose about 2GB at each compaction run, and those run roughly every day, so it's unlikely we'll have room until monday.
last time this happened (#40840 (closed)) we just silenced the warning, we can't get away with that this time...
at the very least we can give it another 20 gigs of space if we want to wait until tuesday (monday is a holiday) to brainstorm in the TPA sync.
disk space was fairly constant until june, and then it started increasing semi-linearly
do we know why? is that space being taken up by the archives of old data? in that case, i'd suggest we start archiving data that's older than 6 months to a year
do we know why? is that space being taken up by the archives of old data? in that case, i'd suggest we start archiving data that's older than 6 months to a year
i don't know. i'm not sure how to tell. but worse, i am not sure there's a way to "archive" data in prometheus... there are complex multi-server setups where one server downsamples from the first, but we're not there yet. see also #40330.
this is a picture of the prometheus dashboard showing specifically the disk space used by the prometheus directory itself. you'll notice there's a correlation between the time it starts going up linearly and some disturbance in the graph. two, in fact: first the color changes in the graph, which means the internal labels on the data series changed. but then another line showed up, and that one is stable at 106GB.
they don't correlate exactly with the beginning of the rise, that said. it seems the rise started at the end of May and the new line showed up in early May. and the color change is even before that.
one thing though: we do have more servers than we had before. 12 months ago, we had 86 instances according to prometheus. now we have 96:
you can also see that we started the bullseye upgrades around April (when the colors changed) and there was a big batch done in May as well (when the new line appeared). the server itself was upgraded in April (#40690 (comment 2793604)).
so it could be some metric that has a cardinal explosion since the bullseye upgrade, that is the only theory i have so far, and it's a hard one to track down...
but then another line showed up, and that one is stable at 106GB.
i think the reason this is happening is most likely because a cron job that's supposed to update it is not running. that line is the /var/lib/prometheus disk usage on the server, and it's definitely not at a steady 106GB:
root@hetzner-nbg1-01:~# du -sh /var/lib/prometheus/137G /var/lib/prometheus/
at 2023-02-10 19:14:06UTC, i deployed a tweak in puppet to disable the logind and systemd collectors. this should reduce the disk usage of those as puppet runs everywhere.
disk space was fairly constant until june, and then it started increasing semi-linearly. we gained 30G since then with ~8GB left. we lose about 2GB at each compaction run, and those run roughly every day, so it's unlikely we'll have room until monday.
another important point: we don't have a compaction every day. that was me misreading the graph. compactions happen over weeks. the last one was on 2023-01-23 and before that on 2023-01-02, so about three weeks. it that is right, we should be compacting in a few days and since we gain 8GB between compaction cycles, it's unlikely we'll eat the remaining 8GB of disk by then.
looking at the disk usage graphs since the puppet deployment (2023-02-10 19:14:06UTC +4-6h (so until 23:14UTC or 2023-02-11 01:14UTC), we see disk usage has stopped growing:
in the above graph, between 2023-02-11 12:00UTC and 2023-02-12 01:00UTC, we seem to be going upwards, gaining more free space as time goes forward. then we have this huge cliff that's the ~monthly compaction, and then disk space is more or less steady.
it's still on a downward trend, however. it's hard to tell, but it seems we're still going downwards: