Skip to content

reduce alert fatigue in Prometheus

I looked at reducing alerting noise before the break (#42221 (closed)) and found out we have way more alerts than what i would like. It's time for a cleanup.

Steps to reproduce

look in #tor-alerts

What is the current bug behavior?

see too much noises, miss some critical alerts (e.g. "gitlab will run out of disk space!")

What is the expected correct behavior?

that's an excellent question: what should our baseline be?

i'd like to have a couple of alerts a week, at most.

right now, we seem to have alerts every day, with a mean of 14 alerts per day, with a day with 669 alerts!

so we should look at reducing that. also, we should look at reducing alert floods: like 600 alerts in a single day, that's unacceptable.

When did this start?

unclear. this is likely an issue that dates from the (near) completion of %TPA-RFC-33-B: Prometheus server merge, more exporters, where most metrics have been imported from nagios.

Relevant logs and/or screenshots

extract from #42221 (comment 3218751)

we only keep one day of logs for that service (!!), so here's the past 24h:

root@hetzner-nbg1-01:~# journalctl -u tpa_http_post_dump.service --since 2025-06-20 -o cat  | grep ^'{' | jq -r '.alerts[] | select(.status == "firing") | "\( .labels.alertname ) "'  | sort | uniq -c | sort
      1 DeadMansSwitch 
      1 IncrementalBackupTooOld 
      1 OutdatedLibraries 
      1 PackagesPendingTooLong 
      1 PuppetAgentErrors 
      2 DRBDDegraded 
      2 DjangoExceptions 
      2 JobDown 
      2 PgArchiverAge 
      3 HTTPSUnreachable 
      3 UnexpectedReboot 
     16 HTTPSResponseDelayExceeded 
    157 SystemdFailedUnits 

looking at my irc logs, i see this, for about the past month:

anarcat@angela:~/s/t/prometheus-alerts> grep ALERTOR alertorlog  | sed 's/.*tor-alerts- //;s/ .*//' | sort | uniq -c | sort -n
      1 2025-06-05
      1 2025-06-13
      1 ApacheDown
      1 PgLegacyBackupsStale
      2 NodeTextfileCollectorErrors
      2 PgBackRestStaleFullBackups
      3 PgLegacyBackupsFailures
      3 PuppetAgentErrors
      4 DRBDDegraded
      4 PgBackRestRepositoryError
      4 PlaintextHTTPUnreachable
      4 SSHUnreachable
      5 PgBackRestStanzaError
      7 IncrementalBackupTooOld
      9 HostDown
      9 PgArchiverFailed
     10 DiskWillFillSoon
     10 PgArchiverAge
     18 NeedsReboot
     24 JobDown
     37 HTTPSUnreachable
     66 DjangoExceptions
     68 UnexpectedReboot
    304 SystemdFailedUnits
    707 HTTPSResponseDelayExceeded

in that time, we had a lot of noise:

anarcat@angela:~/s/t/prometheus-alerts> grep ALERTOR alertorlog  | awk '{print $1}' | sort | uniq -c | datamash -W --header-out --output-delimiter="|" count 1 sum 1 min 1 max 1 median 1 mean 1 sstdev 1
count(field-1)|sum(field-1)|min(field-1)|max(field-1)|median(field-1)|mean(field-1)|sstdev(field-1)
23|1304|2|669|14|56.695652173913|138.11702376233

as a md table:

count(field-1) sum(field-1) min(field-1) max(field-1) median(field-1) mean(field-1) sstdev(field-1)
23 1304 2 669 14 56.695652173913 138.11702376233

that is, out of 24 days, we had 23 days with alerts and, out of those 23 days, we had a median of 14 alerts per day!

not great. really not great.

i'm thinking we should have an issue specifically for reducing that noise going forward. it seems like a quarter of the alerts are SystemdFailedUnits so those could be a good first target. maybe aggregating them for the fleet?

HTTPSResponseDelayExceeded is half of the alerts as well. maybe those thresholds could be raised directly? interestingly, those delays are spread across the fleet pretty uniformly:

anarcat@angela:~/s/t/prometheus-alerts> grep ALERTOR alertorlog  | grep Delay | grep -o '[^ ]*torproject.org[^ ]*' | sed 's,.*//,,;s,/.*,,' | sort | uniq -c | sort -n
      4 lox-test.torproject.org
      6 lox.torproject.org
      7 dev.crm.torproject.org
      8 bridges-email.torproject.org
      8 gitaly-02.torproject.org
      8 test.crm.torproject.org
     13 crm.torproject.org
     15 tb-build-03.torproject.org
     16 rdsys-test-01.torproject.org
     18 onionoo-backend-03.torproject.org
     19 metrics-api.torproject.org
     20 vault.torproject.org
     21 metrics-db.torproject.org
     23 anonticket.torproject.org
     24 tagtor.torproject.org
     27 donate.torproject.org
     28 containers.torproject.org
     30 lists.torproject.org
     30 tb-build-06.torproject.org
     31 dal-rescue.torproject.org
     34 btcpayserver-02.torproject.org
     35 staging.crm.torproject.org
     36 survey.torproject.org
     39 build-sources.tbb.torproject.org
     43 tb-build-02.torproject.org
     50 forum.torproject.org
     60 gitlab.torproject.org

i'm ready to bet that most of those servers are across the ocean too.

Possible fixes

A good first step would be to keep more history: the dump script only outputs to stderr, which ends up in journald, which have varying retention policy, depending on disk usage. From what I could tell, the JSON is about 300KB per day, so we could keep a month of samples for about 10MB. That would mean patching the python script to write to a file instead of stderr, and possibly setting up (daily?) log rotation.

Obviously, once we have better metrics, we should look at improving noise. In #42221 (comment 3218751), i identified latency (threshold raised in prometheus-alerts@b68a52a9) and systemd units as being the main culprits. We also have lots of "unexpected reboots" alerts which, i dunno, maybe we could just ditch entirely?

NeedsReboot also shows up in there and could be demoted to info.

ideally, each alert should be something we need to act on, not a "bah, puppet is making noises again" or "of course gitlab is slow duh". that requires tweaking thresholds, but also might require rearchitecture of some things.

@lelutin i assigned this to you because you worked so much on prom and might have other ideas to contribute, but no pressure: this is not assigning fault, just your field of competence. :) i'm happy to take the issue and/or contribute!

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information