prometheus reboot alerts are extremely noisy on IRC

marked this issue as related to #41633 (closed)

changed milestone to %TPA-RFC-33-B: Prometheus server merge, more exporters

assigned to @lelutin

yeah most definitely. that output is annoying.

that's caused by alertmanager outputting everything again when the set of matching timeseries(hosts) changes, so when one host resolved all of the still not rebooted hosts are shown again.

we may want to check if something can be done on the alertmanager output level. otherwise, it's in the alertmanager-irc-relay that we'll want to look

Can't tell if this applies to your specific situation, but you may want to look into the Alertmanager group_wait option.

background docs on grouping: https://prometheus.io/docs/alerting/latest/alertmanager/#grouping

looking at group_wait in https://prometheus.io/docs/alerting/latest/configuration/ and i'm not sure it's the right thing: that tells the alertmanager to wait for other alerts before sending the notification. it's set to 30s by default.

group_interval, however, seems more appropriate: it tells the alertmanager to wait that amount of time before sending a new notification when an alert is added to a group which seems to be exactly what we're going through here. that's set to 5m here.

that said, the above noise was much more tightly spaced that 5min or even 30s intervals: it's basically dumping (and redumping) all hosts requiring a reboot all the time.

i think this is a group_by issue. right now we have this:

  group_by:
    - 'alertname'
    - 'cluster'
    - 'service'

which is a bit bizarre, because we don't have cluster or service labels at all. we have alertname (sure, that's built-in) but also team, severity defined in the rules, and in the metrics a lot more, like classes, instance, alias, job, and so on.

so perhaps we could group on classes, job or alias here?

cks has some noises about group_interval as well here https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAlertsClearingTime

i don't think the alertmanager-irc-relay webhook is at play here. it looks pretty dumb: it just takes messages and dumps them over IRC, after some formatting. it doesn't do deduplication, and the only configuration i saw was a buffer size on the irc-relay side, and on alertmanager's side, the number of alerts to dump on there, which we don't customize (which is therefore "all alerts").

i think this is really in the alertmanager's config, possibly a grouping issue, thanks @georg for the pointer.

also note another oddity with notifications was spotted in #18, but i doubt it's related. fix could similarly be related to grouping, however.

mentioned in issue prometheus-alerts#18 (closed)

okay, i turned up debugging on alertmanager, and from the needrestart stats on perdulce, i was able to trigger a new notification on irc. i think it triggered because the set of alerts in the alerting group changed, which is an issue on its own.

but, interestingly, this triggered only one webhook notification:

Sep 20 00:25:51 hetzner-nbg1-01 prometheus-alertmanager[881389]: ts=2024-09-20T00:25:51.845Z caller=notify.go:743 level=debug component=dispatcher receiver=irc-tor-admin integration=webhook[0] msg="Notify success" attempts=1

so at least one part of the problem here is that one notification is triggering dozens of notifications on IRC. i suspect changing the way things are grouped in the route might help (group_by?)...

the other problem is that changes in the group trigger a new notification. that, i'm less sure how to fix.

one message that might be worth investigating is the repeat_interval stuff:

Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route={}
Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{severity=\"critical\",team=\"TPA\"}"
Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{severity=~\"critical|warning\",team=\"TPA\"}"
Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{team=\"anti-censorship\"}"
Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{team=\"network\"}"
Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{team=~\"network-health|metrics\"}"

marked this issue as related to #40216

mentioned in issue #40216

so i dived even deeper into this, even though this is not a priority (oops), because it's so damn noisy and we had just investigated so much already.

i tweaked the irc relay config to enable the msg_once_per_alert_group flag, which, instead of sending one message per alert, sends one message per group. i then modified the template to show all the affected hosts on one "line" (which the bot splits up in multiple lines eventually).

the result is something that was like this:

21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host alberti.torproject.org has processes using outdated libraries
21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host anonticket-01.torproject.org has processes using outdated libraries
21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host archive-01.torproject.org has processes using outdated libraries
21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host backup-storage-01.torproject.org has processes using outdated libraries
21:04:39 -ALERTOR1:#tor-alerts- [warning:firing] Host bacula-director-01.torproject.org has processes using outdated libraries
21:04:42 -ALERTOR1:#tor-alerts- [warning:firing] Host btcpayserver-02.torproject.org has processes using outdated libraries
21:04:45 -ALERTOR1:#tor-alerts- [warning:firing] Host bungei.torproject.org has processes using outdated libraries
21:04:48 -ALERTOR1:#tor-alerts- [warning:firing] Host carinatum.torproject.org has processes using outdated libraries
21:04:51 -ALERTOR1:#tor-alerts- [warning:firing] Host cdn-backend-sunet-02.torproject.org has processes using outdated libraries
21:04:54 -ALERTOR1:#tor-alerts- [warning:firing] Host check-01.torproject.org has processes using outdated libraries
21:04:57 -ALERTOR1:#tor-alerts- [warning:firing] Host chives.torproject.org has processes using outdated libraries
21:05:00 -ALERTOR1:#tor-alerts- [warning:resolved] Host ci-runner-x86-02.torproject.org has processes using outdated libraries
21:05:03 -ALERTOR1:#tor-alerts- [warning:firing] Host ci-runner-x86-03.torproject.org has processes using outdated libraries
21:05:06 -ALERTOR1:#tor-alerts- [warning:firing] Host colchicifolium.torproject.org has processes using outdated libraries
21:05:09 -ALERTOR1:#tor-alerts- [warning:firing] Host collector-02.torproject.org has processes using outdated libraries
21:05:12 -ALERTOR1:#tor-alerts- [warning:firing] Host crm-ext-01.torproject.org has processes using outdated libraries
21:05:15 -ALERTOR1:#tor-alerts- [warning:firing] Host crm-int-01.torproject.org has processes using outdated libraries
21:05:17 -ALERTOR1:#tor-alerts- [warning:firing] Host dal-rescue-01.torproject.org has processes using outdated libraries
21:05:20 -ALERTOR1:#tor-alerts- [warning:firing] Host dal-rescue-02.torproject.org has processes using outdated libraries
21:05:23 -ALERTOR1:#tor-alerts- [warning:firing] Host dangerzone-01.torproject.org has processes using outdated libraries
21:05:26 -ALERTOR1:#tor-alerts- [warning:firing] Host donate-01.torproject.org has processes using outdated libraries
21:05:29 -ALERTOR1:#tor-alerts- [warning:firing] Host donate-review-01.torproject.org has processes using outdated libraries
21:05:32 -ALERTOR1:#tor-alerts- [warning:firing] Host forum-01.torproject.org has processes using outdated libraries
21:05:35 -ALERTOR1:#tor-alerts- [warning:firing] Host gitlab-02.torproject.org has processes using outdated libraries
21:05:38 -ALERTOR1:#tor-alerts- [warning:firing] Host henryi.torproject.org has processes using outdated libraries
21:05:41 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-hel1-02.torproject.org has processes using outdated libraries
21:05:44 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-hel1-03.torproject.org has processes using outdated libraries
21:05:47 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-nbg1-01.torproject.org has processes using outdated libraries
21:05:50 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-nbg1-02.torproject.org has processes using outdated libraries
21:05:53 -ALERTOR1:#tor-alerts- [warning:firing] Host loghost01.torproject.org has processes using outdated libraries
21:05:56 -ALERTOR1:#tor-alerts- [warning:firing] Host mandos-01.torproject.org has processes using outdated libraries
21:05:58 -ALERTOR1:#tor-alerts- [warning:firing] Host materculae.torproject.org has processes using outdated libraries
21:06:01 -ALERTOR1:#tor-alerts- [warning:firing] Host media-01.torproject.org has processes using outdated libraries
21:06:04 -ALERTOR1:#tor-alerts- [warning:firing] Host meronense.torproject.org has processes using outdated libraries
21:06:07 -ALERTOR1:#tor-alerts- [warning:resolved] Host metricsdb-01.torproject.org has processes using outdated libraries
21:06:10 -ALERTOR1:#tor-alerts- [warning:firing] Host neriniflorum.torproject.org has processes using outdated libraries
21:06:13 -ALERTOR1:#tor-alerts- [warning:firing] Host nevii.torproject.org has processes using outdated libraries
21:06:16 -ALERTOR1:#tor-alerts- [warning:firing] Host ns3.torproject.org has processes using outdated libraries
21:06:19 -ALERTOR1:#tor-alerts- [warning:firing] Host ns5.torproject.org has processes using outdated libraries
21:06:22 -ALERTOR1:#tor-alerts- [warning:firing] Host onionbalance-02.torproject.org has processes using outdated libraries
21:06:25 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-backend-01.torproject.org has processes using outdated libraries
21:06:28 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-backend-02.torproject.org has processes using outdated libraries
21:06:31 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-backend-03.torproject.org has processes using outdated libraries
21:06:34 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-frontend-01.torproject.org has processes using outdated libraries
21:06:37 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-frontend-02.torproject.org has processes using outdated libraries
21:06:39 -ALERTOR1:#tor-alerts- [warning:firing] Host palmeri.torproject.org has processes using outdated libraries
21:06:45 -ALERTOR1:#tor-alerts- [warning:firing] Host polyanthum.torproject.org has processes using outdated libraries
21:06:47 -ALERTOR1:#tor-alerts- [warning:firing] Host probetelemetry-01.torproject.org has processes using outdated libraries
21:06:50 -ALERTOR1:#tor-alerts- [warning:firing] Host rdsys-frontend-01.torproject.org has processes using outdated libraries
21:06:53 -ALERTOR1:#tor-alerts- [warning:firing] Host rdsys-test-01.torproject.org has processes using outdated libraries
21:06:56 -ALERTOR1:#tor-alerts- [warning:firing] Host relay-01.torproject.org has processes using outdated libraries
21:06:59 -ALERTOR1:#tor-alerts- [warning:firing] Host ssh-dal-01.torproject.org has processes using outdated libraries
21:07:02 -ALERTOR1:#tor-alerts- [warning:firing] Host static-gitlab-shim.torproject.org has processes using outdated libraries
21:07:05 -ALERTOR1:#tor-alerts- [warning:firing] Host staticiforme.torproject.org has processes using outdated libraries
21:07:08 -ALERTOR1:#tor-alerts- [warning:firing] Host submit-01.torproject.org has processes using outdated libraries
21:07:11 -ALERTOR1:#tor-alerts- [warning:firing] Host survey-01.torproject.org has processes using outdated libraries
21:07:14 -ALERTOR1:#tor-alerts- [warning:firing] Host tb-build-02.torproject.org has processes using outdated libraries
21:07:17 -ALERTOR1:#tor-alerts- [warning:resolved] Host tb-build-06.torproject.org has processes using outdated libraries
21:07:20 -ALERTOR1:#tor-alerts- [warning:firing] Host tb-pkgstage-01.torproject.org has processes using outdated libraries
21:07:23 -ALERTOR1:#tor-alerts- [warning:firing] Host tb-tester-01.torproject.org has processes using outdated libraries
21:07:26 -ALERTOR1:#tor-alerts- [warning:firing] Host tbb-nightlies-master.torproject.org has processes using outdated libraries
21:07:29 -ALERTOR1:#tor-alerts- [warning:firing] Host vault-01.torproject.org has processes using outdated libraries
21:07:31 -ALERTOR1:#tor-alerts- [warning:firing] Host weather-01.torproject.org has processes using outdated libraries
21:07:34 -ALERTOR1:#tor-alerts- [warning:firing] Host web-dal-07.torproject.org has processes using outdated libraries
21:07:37 -ALERTOR1:#tor-alerts- [warning:firing] Host web-dal-08.torproject.org has processes using outdated libraries
21:07:42 -ALERTOR1:#tor-alerts- [warning:firing] Host web-fsn-01.torproject.org has processes using outdated libraries
21:07:45 -ALERTOR1:#tor-alerts- [warning:firing] Host web-fsn-02.torproject.org has processes using outdated libraries

turns into this:

23:29:36 -ALERTOR1:#tor-alerts- OutdatedLibraries (warning) for node is firing 73 alerts on archive-01.torproject.org backup-storage-01.torproject.org 
          bacula-director-01.torproject.org bungei.torproject.org carinatum.torproject.org check-01.torproject.org chives.torproject.org 
          ci-runner-x86-03.torproject.org collector-02.torproject.org crm-int-01.torproject.org dal-rescue-01.torproject.org dal-rescue-02.torproject.org 
          dangerzone-01.torproject.org donate-01.torproj
23:29:36 -ALERTOR1:#tor-alerts- donate-review-01.torproject.org forum-01.torproject.org fsn-node-01.torproject.org fsn-node-02.torproject.org 
          fsn-node-03.torproject.org fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-07.torproject.org 
          fsn-node-08.torproject.org gayi.torproject.org gitlab-02.torproject.org henryi.torproject.org hetzner-hel1-02.torproject.org 
          hetzner-hel1-03.torproject.org hetzner-nbg1-01.torproject.org ...
23:29:42 -ALERTOR1:#tor-alerts- hetzner-nbg1-02.torproject.org loghost01.torproject.org mandos-01.torproject.org materculae.torproject.org 
          media-01.torproject.org meronense.torproject.org metrics-store-01.torproject.org neriniflorum.torproject.org nevii.torproject.org 
          ns3.torproject.org ns5.torproject.org onionbalance-02.torproject.org onionoo-backend-01.torproject.org onionoo-backend-02.torproject.org 
          onionoo-backend-03.torproject.org onionoo-frontend-01.torproject.
23:29:48 -ALERTOR1:#tor-alerts- onionoo-frontend-02.torproject.org palmeri.torproject.org pauli.torproject.org polyanthum.torproject.org 
          probetelemetry-01.torproject.org puppetdb-01.torproject.org rdsys-frontend-01.torproject.org rdsys-test-01.torproject.org relay-01.torproject.org 
          rude.torproject.org ssh-dal-01.torproject.org static-master-fsn.torproject.org staticiforme.torproject.org submit-01.torproject.org 
          survey-01.torproject.org tb-build-02.torproject.org ...
23:29:53 -ALERTOR1:#tor-alerts- tb-build-03.torproject.org tb-pkgstage-01.torproject.org tb-tester-01.torproject.org tbb-nightlies-master.torproject.org 
          telegram-bot-01.torproject.org vault-01.torproject.org weather-01.torproject.org web-dal-07.torproject.org web-dal-08.torproject.org 
          web-fsn-01.torproject.org web-fsn-02.torproject.org

it's clearly not ideal: the output is a little garbled, and still very verbose. but at least it's six lines, and not eighty, every time a host recovers.

The downside is that we don't see which host recovers, just that the counter goes down.

Also, unfortunately, it's not clear at all the counter is correct: it's counting the number of elements in the "alerts" list, but this can actually include resolved alerts! so it's not very accurate. and unfortunately, we're very limited in what we can do inside those golang templates.

for example, to limit the number of hosts shown, i have tried the range $i, $v := .Alerts pattern, but the irc relay fails to start with "unexpected ," or something, really strange.

we'd need helper functions in the format.go code in the relay or, ideally, a way to only send new notifications if they're resolved, or weren't sent already. because the relay doesn't keep state, it probably can't do the latter.

i have found there's a PR to add a different template for resolved notifications, in https://github.com/google/alertmanager-irc-relay/pull/21

it doesn't help us much, but shows the formatter knows about that status and a bit of the code structure that could allow us to do what we need. it requires a patch however.

another avenue would be to look at matrix notifications (#40216). in that issue, there are many different Matrix relays mentioned, and i added more, but it's not clear to me any of those would address our issues here.

while looking for a matrix bot i have, of course, found another IRC bot as well:

https://github.com/multimfi/bot

i also found a cool trick: https://github.com/moan0s/alertbot/tree/main/alert_examples has a bunch of alerts that can be dumped on a webhook endpoint to simulate an alert notification from alertmanager, with:

curl --header "Content-Type: application/json" \
  --request POST \
  --data "@alert_examples/prometheus.json" \
  https://webhook.example.com/_matrix/maubot/plugin/maubot/webhook/!zOcbWjsWzdREnihreC:example.com

that's something we could use to test our setup, with various payloads. i've been running tcpdump -A -n -i lo port 8099 to look at payloads, but we need something more solid to extract the actual payloads more clearly.

perhaps we could have another webhook endpoint that just logs the payloads, and look a here, tomtom-international/alertmanager-webhook-logger does exactly that! looks pretty trivial too. only one dependent module is missing from debian.

Debian has the webhook thing packaged that we could also use for logging.

i actually went ahead and wrote a trivial logger for now, it's like 20 lines of python. i haven't yet set it up as a service because i'm not sure we want this in the long term. for now it's just logging in a screen(1).

assigned to @anarcat and unassigned @lelutin

i think i have improved on things quite a bit here.

here's the last notification:

00:29:36 -ALERTOR1:#tor-alerts- warning: OutdatedLibraries for node is firing 8 alerts on fsn-node-01.torproject.org fsn-node-02.torproject.org 
          fsn-node-03.torproject.org fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-07.torproject.org 
          fsn-node-08.torproject.org 
00:36:21 -ALERTOR1:#tor-alerts- warning: DRBDDegraded for node is firing 1 alerts on fsn-node-01.torproject.org 
00:41:21 -ALERTOR1:#tor-alerts- warning: DRBDDegraded for node is resolved 
01:14:37 -ALERTOR1:#tor-alerts- OutdatedLibraries[node] alert warning is firing 7 alerts on  fsn-node-01.torproject.org fsn-node-03.torproject.org 
          fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-07.torproject.org fsn-node-08.torproject.org

we don't see colors there, so it's not as great, here's a screenshot:

here you can see the fsn-node-07 server was just taken out of the list of affected servers. the count is not quite right, as technically there are now 6 alerts firing, but it's pretty close.

mentioned in commit wiki-replica@de66b91f

we'll see on next reboot, but i think this has improved tremendously with the switch to "one alert per group", so marking as needs review.

added Needs Review label and removed Backlog label

This is still an issue. Today we've had a new kernel update come through a DSA and we've had tons of warnings in the channel:

03:09:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, 1 alerts on dangerzone-01.torproject.org
03:14:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, 3 alerts on dangerzone-01.torproject.org henryi.torproject.org 
          tb-tester-01.torproject.org
03:19:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, 6 alerts on dangerzone-01.torproject.org henryi.torproject.org 
          meronense.torproject.org metricsdb-01.torproject.org tb-tester-01.torproject.org web-fsn-01.torproject.org
03:24:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, 9 alerts on bacula-director-01.torproject.org dangerzone-01.torproject.org 
          henryi.torproject.org media-01.torproject.org meronense.torproject.org metricsdb-01.torproject.org tb-tester-01.torproject.org 
          tbb-nightlies-master.torproject.org web-fsn-01.torproject.org
03:29:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, 12 alerts on bacula-director-01.torproject.org dangerzone-01.torproject.org 
          henryi.torproject.org idle-fsn-01.torproject.org media-01.torproject.org meronense.torproject.org metricsdb-01.torproject.org nevii.torproject.org 
          staticiforme.torproject.org tb-tester-01.torproject.org tbb-nightlies-master.torproject.org web-fsn-01.torproject.org
03:34:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, ...
03:34:22 -ALERTOR1:#tor-alerts- 18 alerts on alberti.torproject.org bacula-director-01.torproject.org check-01.torproject.org chives.torproject.org 
          dangerzone-01.torproject.org henryi.torproject.org idle-fsn-01.torproject.org media-01.torproject.org meronense.torproject.org 
          metricsdb-01.torproject.org neriniflorum.torproject.org nevii.torproject.org rude.torproject.org staticiforme.torproject.org 
          tb-tester-01.torproject.org tbb-nightlies-master.torproject.org ...
03:34:22 -ALERTOR1:#tor-alerts- web-dal-07.torproject.org web-fsn-01.torproject.org
03:34:40 -ALERTOR2:#tor-alerts- HTTPSResponseDelayExceeded[blackbox/warning] alert is firing, 1 alerts on http://gitlab.torproject.org
03:39:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, ...
03:39:22 -ALERTOR1:#tor-alerts- 23 alerts on alberti.torproject.org bacula-director-01.torproject.org check-01.torproject.org chives.torproject.org 
          ci-runner-x86-03.torproject.org dangerzone-01.torproject.org gayi.torproject.org henryi.torproject.org idle-fsn-01.torproject.org 
          media-01.torproject.org meronense.torproject.org metricsdb-01.torproject.org neriniflorum.torproject.org nevii.torproject.org 
          onionoo-backend-02.torproject.org rude.torproject.org ...
03:39:22 -ALERTOR1:#tor-alerts- static-master-fsn.torproject.org staticiforme.torproject.org tb-tester-01.torproject.org tbb-nightlies-master.torproject.org 
          vault-01.torproject.org web-dal-07.torproject.org web-fsn-01.torproject.org

... this kept going on for a while, we're now at:

20:09:22 -ALERTOR1:#tor-alerts- NeedsReboot[node/warning] alert is firing, ...
20:09:22 -ALERTOR1:#tor-alerts- 56 alerts on alberti.torproject.org anonticket-01.torproject.org bacula-director-01.torproject.org 
          btcpayserver-02.torproject.org bungei.torproject.org cdn-backend-sunet-02.torproject.org check-01.torproject.org chives.torproject.org 
          ci-runner-x86-02.torproject.org ci-runner-x86-03.torproject.org colchicifolium.torproject.org collector-02.torproject.org 
          crm-ext-01.torproject.org crm-int-01.torproject.org dal-rescue-01.torproject.org ..
20:09:28 -ALERTOR1:#tor-alerts- dal-rescue-02.torproject.org dangerzone-01.torproject.org donate-review-01.torproject.org gayi.torproject.org 
          gitlab-02.torproject.org henryi.torproject.org hetzner-hel1-03.torproject.org hetzner-nbg1-01.torproject.org idle-fsn-01.torproject.org 
          mandos-01.torproject.org materculae.torproject.org media-01.torproject.org meronense.torproject.org metricsdb-01.torproject.org 
          minio-01.torproject.org neriniflorum.torproject.org nevii.torproj
20:09:33 -ALERTOR1:#tor-alerts- ns5.torproject.org onionbalance-02.torproject.org onionoo-backend-02.torproject.org onionoo-frontend-01.torproject.org 
          onionoo-frontend-02.torproject.org palmeri.torproject.org polyanthum.torproject.org rdsys-frontend-01.torproject.org rdsys-test-01.torproject.org 
          relay-01.torproject.org rude.torproject.org static-master-fsn.torproject.org staticiforme.torproject.org tb-build-03.torproject.org 
          tb-build-06.torproject.org ...
20:09:38 -ALERTOR1:#tor-alerts- tb-pkgstage-01.torproject.org tb-tester-01.torproject.org tbb-nightlies-master.torproject.org telegram-bot-01.torproject.org 
          vault-01.torproject.org web-dal-07.torproject.org web-dal-08.torproject.org web-fsn-01.torproject.org web-fsn-02.torproject.org

... that is, 56 alerts, so it's actually probably not done yet.

So this just doesn't work at all, we really need to figure something out here.

added Backlog label and removed Needs Review label

During a somewhat long drive, that said, I had some ideas.

Our problem here is that we're progressively getting more and more alerts added to the group, and each of those (modulo the group_interval) trigger a notification, with more and more content into it. How do we fix this?

A few ideas, TL;DR:

info severity
group_interval: 24h route
scope=fleet
group_by: version_codename

Long version:

add a new value for the severity label in our alerts, in this case info which doesn't notify on IRC at all (credits to @lelutin for that idea)

This should probably be done in the short term in any case, because reboot runs are just too disruptive, but IMHO not a long term fix because we do want to see this on IRC. Otherwise it creates a dissonance because suddenly IRC stops being a full log of alerts, which is kind of nice to have right now because it's pretty much the only place we have this, short of the JSON dump.
add a new route specifically for this alert, with different group_interval settings.

we could, for example, have a 24h interval so that we wait a really long time before sending duplicate alerts like this.

the downside of that is that it would also take a loooong time for the alerts to be marked as resolved, but i think that's okay because we can always tap in the prom API directly to get the list of servers needing a reboot. that is, actually, how i've done the last reboot runs, by asking prom the list of servers needing a reboot... Not directly related, but it's similar to how host.all-pending-upgrades works, by probing Prom for a server list.
add a scope: fleet label for noisy alerts that likely affect all servers,

This is mainly cosmetic, to avoid routing the alert by name and duplicating the business logic out of prometheus-alerts.git. By having a scope label, we can specify in prometheus-alerts.git how an alert should be routed, instead of having to copy the alert name to the alertmanager config.

This also allows for other similar alerts to be processed correctly. OutdatedLibraries could have a similar scope.

Ideas of scopes:
- instance: current configuration, new default
- cluster: say, gnt-dal vs gnt-fsn, or web-mirrors or something, theoretical at this point
- fleet: applies to all machines, this is what the NeedsReboot alert would be labeled at, and used to route it properly as per step 2
the holy grail: only one notification when such an alert comes in, by grouping per OS version.

in this case, our alert would look like:
```
NeedsReboot[node/warning] alert is firing 56 machines running bookworm need a reboot
```
probably not possible to literally do this in the current alertmanager-irc-relay service (if at all), as we send only one alert per group, but we turned that setting on precisely because of this issue in the first place, so if we have proper scoping, perhaps this might just work.

we would need to have the OS version in the alert labels (so probably some join or something) and show it as alias (or, actually, if we stop doing the grouping, then our alerting templates work again, and we just use the right thing in the alert summary annotation

So, in other words, i think we have lots of options here.

I think the first step is probably, as you say, to route the info alerts to devnull (or, more precisely, route them only to the logger, so having one first route before the continue logger). This would also show us whether we can have multiple routes with the same receivers but with different parameters. Then we can try the group_by hack, and finally figure out a magic query that would join on the OS version.

holy crap that works

sum by (version_codename) (
  node_reboot_required * on(instance) 
  group_left(version_codename) node_os_info
)

this currently says:

Element	Value
`{version_codename="bookworm"}`	56
`{version_codename="bullseye"}`	0
`{version_codename="buster"}`	0

i must admit i couldn't figure this out on my own, i was stumbling upon the very obscure vector matching operators and only got as far as node_reboot_required group_left node_os_info, but that was pretty close no? anyways, the above is from GPT-4o, and i think it works! it doesn't have the "normal" label we expect for alert routing though, so we actually need this instead:

sum by (team,job,version_codename) (
  node_reboot_required * on(instance) 
  group_left(version_codename) node_os_info
)

this gives us:

Element	Value
{job="node",team="TPA",version_codename="bookworm"}	56
{job="node",team="TPA",version_codename="bullseye"}	0
{job="node",team="TPA",version_codename="buster"}	0

we can't send the alias or instance though, because then we'd go back to the grouping issue we had. and it would require not grouping alerts anymore...

so anyways, probably something to try next, actually. huge advantage is it doesn't require any extra severity or scope label: we "just" need to change the alerting rule, its template, and the alertmanager-irc-relay configuration.

i'll make a MR (prometheus-alerts!55 (merged)).

mentioned in commit prometheus-alerts@26d59a58

mentioned in commit prometheus-alerts@31824521

mentioned in merge request prometheus-alerts!55 (merged)

mentioned in commit prometheus-alerts@7d970d0c

okay, so @lelutin figured out how to deploy this with unit tests in prometheus-alerts!55 (merged) and has merged the proposed change.. so now i need to deploy this and tweak the irc relay to switch back to one message per alert instead of one per group...

@lelutin i think we'll also need to tweak the OutdatedLibraries alert though, do you think you could look into that one as well now?

added Doing label and removed Backlog label

marked this issue as related to #41811 (closed)

i deployed the change in puppet. there was a hunk i missed in the revert which broke the relay, as it was showing plain JSON flooding in the channel, but that has been fixed.

we're waiting to see how well this works now.

added Needs Review label and removed Doing label

mentioned in commit prometheus-alerts@2d5b3411

marked this issue as related to prometheus-alerts#21 (closed)

i think this works, i filed prometheus-alerts#21 (closed) because there are weird alerts with "zero hosts", but we fixed the flood.

closed

this is an issue again:

16:06:22 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host anonticket-01.torproject.org has recently unexpectedly rebooted
16:06:22 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host btcpayserver-02.torproject.org has recently unexpectedly rebooted
16:06:22 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host ci-runner-x86-02.torproject.org has recently unexpectedly rebooted
16:06:22 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host dal-node-01.torproject.org has recently unexpectedly rebooted
16:06:25 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host dangerzone-01.torproject.org has recently unexpectedly rebooted
16:06:27 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host donate-01.torproject.org has recently unexpectedly rebooted
16:06:30 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host fsn-node-01.torproject.org has recently unexpectedly rebooted
16:06:33 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host idle-dal-02.torproject.org has recently unexpectedly rebooted
16:06:36 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host minio-01.torproject.org has recently unexpectedly rebooted
16:06:39 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host ns5.torproject.org has recently unexpectedly rebooted
16:06:42 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host rdsys-test-01.torproject.org has recently unexpectedly rebooted
16:06:45 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host static-gitlab-shim.torproject.org has recently unexpectedly rebooted
[...]
16:22:38 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host web-dal-07.torproject.org has recently unexpectedly rebooted
16:22:41 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host web-dal-08.torproject.org has recently unexpectedly rebooted
16:22:44 -ALERTOR1:#tor-alerts- UnexpectedReboot [firing] Host web-fsn-01.torproject.org has recently unexpectedly rebooted
16:29:06 -ALERTOR1:#tor-alerts- HostDown [firing] Host ssh-dal-01.torproject.org is not responding CRITICAL!
[...]

reopened

added Next label and removed Needs Review label

assigned to @lelutin and unassigned @anarcat

I took a quick look during the reboot run and the silences were getting created, and I could view them in karma

INFO: adding silence from 2025-01-14T22:38:51+00:00 to 2025-01-14T22:44:21+00:00 (0:05:30), created by: tor, comment: silencing all alerts for reboot, matchers: alias=fsn-node-08.torproject.org
INFO: posted silence 54ea8c02-eb79-4108-8423-104fe86a0c92: https://alertmanager.torproject.org/#/silences/54ea8c02-eb79-4108-8423-104fe86a0c92

maybe the expiration delay is not long enough for the unexpectedreboot alert to clear up?

closed

prometheus reboot alerts are extremely noisy on IRC

Designs

Child items 0

Activity