reduce alert fatigue in Prometheus
I looked at reducing alerting noise before the break (#42221 (closed)) and found out we have way more alerts than what i would like. It's time for a cleanup.
Steps to reproduce
look in #tor-alerts
What is the current bug behavior?
see too much noises, miss some critical alerts (e.g. "gitlab will run out of disk space!")
What is the expected correct behavior?
that's an excellent question: what should our baseline be?
i'd like to have a couple of alerts a week, at most.
right now, we seem to have alerts every day, with a mean of 14 alerts per day, with a day with 669 alerts!
so we should look at reducing that. also, we should look at reducing alert floods: like 600 alerts in a single day, that's unacceptable.
When did this start?
unclear. this is likely an issue that dates from the (near) completion of %TPA-RFC-33-B: Prometheus server merge, more exporters, where most metrics have been imported from nagios.
Relevant logs and/or screenshots
extract from #42221 (comment 3218751)
we only keep one day of logs for that service (!!), so here's the past 24h:
root@hetzner-nbg1-01:~# journalctl -u tpa_http_post_dump.service --since 2025-06-20 -o cat | grep ^'{' | jq -r '.alerts[] | select(.status == "firing") | "\( .labels.alertname ) "' | sort | uniq -c | sort 1 DeadMansSwitch 1 IncrementalBackupTooOld 1 OutdatedLibraries 1 PackagesPendingTooLong 1 PuppetAgentErrors 2 DRBDDegraded 2 DjangoExceptions 2 JobDown 2 PgArchiverAge 3 HTTPSUnreachable 3 UnexpectedReboot 16 HTTPSResponseDelayExceeded 157 SystemdFailedUnits
looking at my irc logs, i see this, for about the past month:
anarcat@angela:~/s/t/prometheus-alerts> grep ALERTOR alertorlog | sed 's/.*tor-alerts- //;s/ .*//' | sort | uniq -c | sort -n 1 2025-06-05 1 2025-06-13 1 ApacheDown 1 PgLegacyBackupsStale 2 NodeTextfileCollectorErrors 2 PgBackRestStaleFullBackups 3 PgLegacyBackupsFailures 3 PuppetAgentErrors 4 DRBDDegraded 4 PgBackRestRepositoryError 4 PlaintextHTTPUnreachable 4 SSHUnreachable 5 PgBackRestStanzaError 7 IncrementalBackupTooOld 9 HostDown 9 PgArchiverFailed 10 DiskWillFillSoon 10 PgArchiverAge 18 NeedsReboot 24 JobDown 37 HTTPSUnreachable 66 DjangoExceptions 68 UnexpectedReboot 304 SystemdFailedUnits 707 HTTPSResponseDelayExceeded
in that time, we had a lot of noise:
anarcat@angela:~/s/t/prometheus-alerts> grep ALERTOR alertorlog | awk '{print $1}' | sort | uniq -c | datamash -W --header-out --output-delimiter="|" count 1 sum 1 min 1 max 1 median 1 mean 1 sstdev 1 count(field-1)|sum(field-1)|min(field-1)|max(field-1)|median(field-1)|mean(field-1)|sstdev(field-1) 23|1304|2|669|14|56.695652173913|138.11702376233
as a md table:
count(field-1) sum(field-1) min(field-1) max(field-1) median(field-1) mean(field-1) sstdev(field-1) 23 1304 2 669 14 56.695652173913 138.11702376233 that is, out of 24 days, we had 23 days with alerts and, out of those 23 days, we had a median of 14 alerts per day!
not great. really not great.
i'm thinking we should have an issue specifically for reducing that noise going forward. it seems like a quarter of the alerts are SystemdFailedUnits so those could be a good first target. maybe aggregating them for the fleet?
HTTPSResponseDelayExceeded is half of the alerts as well. maybe those thresholds could be raised directly? interestingly, those delays are spread across the fleet pretty uniformly:
anarcat@angela:~/s/t/prometheus-alerts> grep ALERTOR alertorlog | grep Delay | grep -o '[^ ]*torproject.org[^ ]*' | sed 's,.*//,,;s,/.*,,' | sort | uniq -c | sort -n 4 lox-test.torproject.org 6 lox.torproject.org 7 dev.crm.torproject.org 8 bridges-email.torproject.org 8 gitaly-02.torproject.org 8 test.crm.torproject.org 13 crm.torproject.org 15 tb-build-03.torproject.org 16 rdsys-test-01.torproject.org 18 onionoo-backend-03.torproject.org 19 metrics-api.torproject.org 20 vault.torproject.org 21 metrics-db.torproject.org 23 anonticket.torproject.org 24 tagtor.torproject.org 27 donate.torproject.org 28 containers.torproject.org 30 lists.torproject.org 30 tb-build-06.torproject.org 31 dal-rescue.torproject.org 34 btcpayserver-02.torproject.org 35 staging.crm.torproject.org 36 survey.torproject.org 39 build-sources.tbb.torproject.org 43 tb-build-02.torproject.org 50 forum.torproject.org 60 gitlab.torproject.org
i'm ready to bet that most of those servers are across the ocean too.
Possible fixes
A good first step would be to keep more history: the dump script only outputs to stderr, which ends up in journald, which have varying retention policy, depending on disk usage. From what I could tell, the JSON is about 300KB per day, so we could keep a month of samples for about 10MB. That would mean patching the python script to write to a file instead of stderr, and possibly setting up (daily?) log rotation.
Obviously, once we have better metrics, we should look at improving noise. In #42221 (comment 3218751), i identified latency (threshold raised in prometheus-alerts@b68a52a9) and systemd units as being the main culprits. We also have lots of "unexpected reboots" alerts which, i dunno, maybe we could just ditch entirely?
NeedsReboot
also shows up in there and could be demoted to info
.
ideally, each alert should be something we need to act on, not a "bah, puppet is making noises again" or "of course gitlab is slow duh". that requires tweaking thresholds, but also might require rearchitecture of some things.
@lelutin i assigned this to you because you worked so much on prom and might have other ideas to contribute, but no pressure: this is not assigning fault, just your field of competence. :) i'm happy to take the issue and/or contribute!