that's caused by alertmanager outputting everything again when the set of matching timeseries(hosts) changes, so when one host resolved all of the still not rebooted hosts are shown again.
we may want to check if something can be done on the alertmanager output level. otherwise, it's in the alertmanager-irc-relay that we'll want to look
looking at group_wait in https://prometheus.io/docs/alerting/latest/configuration/ and i'm not sure it's the right thing: that tells the alertmanager to wait for other alerts before sending the notification. it's set to 30s by default.
group_interval, however, seems more appropriate: it tells the alertmanager to wait that amount of time before sending a new notification when an alert is added to a group which seems to be exactly what we're going through here. that's set to 5m here.
that said, the above noise was much more tightly spaced that 5min or even 30s intervals: it's basically dumping (and redumping) all hosts requiring a reboot all the time.
i think this is a group_by issue. right now we have this:
group_by:-'alertname'-'cluster'-'service'
which is a bit bizarre, because we don't have cluster or service labels at all. we have alertname (sure, that's built-in) but also team, severity defined in the rules, and in the metrics a lot more, like classes, instance, alias, job, and so on.
so perhaps we could group on classes, job or alias here?
i don't think the alertmanager-irc-relay webhook is at play here. it looks pretty dumb: it just takes messages and dumps them over IRC, after some formatting. it doesn't do deduplication, and the only configuration i saw was a buffer size on the irc-relay side, and on alertmanager's side, the number of alerts to dump on there, which we don't customize (which is therefore "all alerts").
i think this is really in the alertmanager's config, possibly a grouping issue, thanks @georg for the pointer.
also note another oddity with notifications was spotted in #18, but i doubt it's related. fix could similarly be related to grouping, however.
okay, i turned up debugging on alertmanager, and from the needrestart stats on perdulce, i was able to trigger a new notification on irc. i think it triggered because the set of alerts in the alerting group changed, which is an issue on its own.
but, interestingly, this triggered only one webhook notification:
so at least one part of the problem here is that one notification is triggering dozens of notifications on IRC. i suspect changing the way things are grouped in the route might help (group_by?)...
the other problem is that changes in the group trigger a new notification. that, i'm less sure how to fix.
one message that might be worth investigating is the repeat_interval stuff:
Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route={}Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{severity=\"critical\",team=\"TPA\"}"Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{severity=~\"critical|warning\",team=\"TPA\"}"Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{team=\"anti-censorship\"}"Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{team=\"network\"}"Sep 20 00:34:08 hetzner-nbg1-01 prometheus-alertmanager[882517]: ts=2024-09-20T00:34:08.105Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=8760h0m0s retention=120h0m0s route="{}/{team=~\"network-health|metrics\"}"
so i dived even deeper into this, even though this is not a priority (oops), because it's so damn noisy and we had just investigated so much already.
i tweaked the irc relay config to enable the msg_once_per_alert_group flag, which, instead of sending one message per alert, sends one message per group. i then modified the template to show all the affected hosts on one "line" (which the bot splits up in multiple lines eventually).
the result is something that was like this:
21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host alberti.torproject.org has processes using outdated libraries21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host anonticket-01.torproject.org has processes using outdated libraries21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host archive-01.torproject.org has processes using outdated libraries21:04:36 -ALERTOR1:#tor-alerts- [warning:firing] Host backup-storage-01.torproject.org has processes using outdated libraries21:04:39 -ALERTOR1:#tor-alerts- [warning:firing] Host bacula-director-01.torproject.org has processes using outdated libraries21:04:42 -ALERTOR1:#tor-alerts- [warning:firing] Host btcpayserver-02.torproject.org has processes using outdated libraries21:04:45 -ALERTOR1:#tor-alerts- [warning:firing] Host bungei.torproject.org has processes using outdated libraries21:04:48 -ALERTOR1:#tor-alerts- [warning:firing] Host carinatum.torproject.org has processes using outdated libraries21:04:51 -ALERTOR1:#tor-alerts- [warning:firing] Host cdn-backend-sunet-02.torproject.org has processes using outdated libraries21:04:54 -ALERTOR1:#tor-alerts- [warning:firing] Host check-01.torproject.org has processes using outdated libraries21:04:57 -ALERTOR1:#tor-alerts- [warning:firing] Host chives.torproject.org has processes using outdated libraries21:05:00 -ALERTOR1:#tor-alerts- [warning:resolved] Host ci-runner-x86-02.torproject.org has processes using outdated libraries21:05:03 -ALERTOR1:#tor-alerts- [warning:firing] Host ci-runner-x86-03.torproject.org has processes using outdated libraries21:05:06 -ALERTOR1:#tor-alerts- [warning:firing] Host colchicifolium.torproject.org has processes using outdated libraries21:05:09 -ALERTOR1:#tor-alerts- [warning:firing] Host collector-02.torproject.org has processes using outdated libraries21:05:12 -ALERTOR1:#tor-alerts- [warning:firing] Host crm-ext-01.torproject.org has processes using outdated libraries21:05:15 -ALERTOR1:#tor-alerts- [warning:firing] Host crm-int-01.torproject.org has processes using outdated libraries21:05:17 -ALERTOR1:#tor-alerts- [warning:firing] Host dal-rescue-01.torproject.org has processes using outdated libraries21:05:20 -ALERTOR1:#tor-alerts- [warning:firing] Host dal-rescue-02.torproject.org has processes using outdated libraries21:05:23 -ALERTOR1:#tor-alerts- [warning:firing] Host dangerzone-01.torproject.org has processes using outdated libraries21:05:26 -ALERTOR1:#tor-alerts- [warning:firing] Host donate-01.torproject.org has processes using outdated libraries21:05:29 -ALERTOR1:#tor-alerts- [warning:firing] Host donate-review-01.torproject.org has processes using outdated libraries21:05:32 -ALERTOR1:#tor-alerts- [warning:firing] Host forum-01.torproject.org has processes using outdated libraries21:05:35 -ALERTOR1:#tor-alerts- [warning:firing] Host gitlab-02.torproject.org has processes using outdated libraries21:05:38 -ALERTOR1:#tor-alerts- [warning:firing] Host henryi.torproject.org has processes using outdated libraries21:05:41 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-hel1-02.torproject.org has processes using outdated libraries21:05:44 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-hel1-03.torproject.org has processes using outdated libraries21:05:47 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-nbg1-01.torproject.org has processes using outdated libraries21:05:50 -ALERTOR1:#tor-alerts- [warning:firing] Host hetzner-nbg1-02.torproject.org has processes using outdated libraries21:05:53 -ALERTOR1:#tor-alerts- [warning:firing] Host loghost01.torproject.org has processes using outdated libraries21:05:56 -ALERTOR1:#tor-alerts- [warning:firing] Host mandos-01.torproject.org has processes using outdated libraries21:05:58 -ALERTOR1:#tor-alerts- [warning:firing] Host materculae.torproject.org has processes using outdated libraries21:06:01 -ALERTOR1:#tor-alerts- [warning:firing] Host media-01.torproject.org has processes using outdated libraries21:06:04 -ALERTOR1:#tor-alerts- [warning:firing] Host meronense.torproject.org has processes using outdated libraries21:06:07 -ALERTOR1:#tor-alerts- [warning:resolved] Host metricsdb-01.torproject.org has processes using outdated libraries21:06:10 -ALERTOR1:#tor-alerts- [warning:firing] Host neriniflorum.torproject.org has processes using outdated libraries21:06:13 -ALERTOR1:#tor-alerts- [warning:firing] Host nevii.torproject.org has processes using outdated libraries21:06:16 -ALERTOR1:#tor-alerts- [warning:firing] Host ns3.torproject.org has processes using outdated libraries21:06:19 -ALERTOR1:#tor-alerts- [warning:firing] Host ns5.torproject.org has processes using outdated libraries21:06:22 -ALERTOR1:#tor-alerts- [warning:firing] Host onionbalance-02.torproject.org has processes using outdated libraries21:06:25 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-backend-01.torproject.org has processes using outdated libraries21:06:28 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-backend-02.torproject.org has processes using outdated libraries21:06:31 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-backend-03.torproject.org has processes using outdated libraries21:06:34 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-frontend-01.torproject.org has processes using outdated libraries21:06:37 -ALERTOR1:#tor-alerts- [warning:firing] Host onionoo-frontend-02.torproject.org has processes using outdated libraries21:06:39 -ALERTOR1:#tor-alerts- [warning:firing] Host palmeri.torproject.org has processes using outdated libraries21:06:45 -ALERTOR1:#tor-alerts- [warning:firing] Host polyanthum.torproject.org has processes using outdated libraries21:06:47 -ALERTOR1:#tor-alerts- [warning:firing] Host probetelemetry-01.torproject.org has processes using outdated libraries21:06:50 -ALERTOR1:#tor-alerts- [warning:firing] Host rdsys-frontend-01.torproject.org has processes using outdated libraries21:06:53 -ALERTOR1:#tor-alerts- [warning:firing] Host rdsys-test-01.torproject.org has processes using outdated libraries21:06:56 -ALERTOR1:#tor-alerts- [warning:firing] Host relay-01.torproject.org has processes using outdated libraries21:06:59 -ALERTOR1:#tor-alerts- [warning:firing] Host ssh-dal-01.torproject.org has processes using outdated libraries21:07:02 -ALERTOR1:#tor-alerts- [warning:firing] Host static-gitlab-shim.torproject.org has processes using outdated libraries21:07:05 -ALERTOR1:#tor-alerts- [warning:firing] Host staticiforme.torproject.org has processes using outdated libraries21:07:08 -ALERTOR1:#tor-alerts- [warning:firing] Host submit-01.torproject.org has processes using outdated libraries21:07:11 -ALERTOR1:#tor-alerts- [warning:firing] Host survey-01.torproject.org has processes using outdated libraries21:07:14 -ALERTOR1:#tor-alerts- [warning:firing] Host tb-build-02.torproject.org has processes using outdated libraries21:07:17 -ALERTOR1:#tor-alerts- [warning:resolved] Host tb-build-06.torproject.org has processes using outdated libraries21:07:20 -ALERTOR1:#tor-alerts- [warning:firing] Host tb-pkgstage-01.torproject.org has processes using outdated libraries21:07:23 -ALERTOR1:#tor-alerts- [warning:firing] Host tb-tester-01.torproject.org has processes using outdated libraries21:07:26 -ALERTOR1:#tor-alerts- [warning:firing] Host tbb-nightlies-master.torproject.org has processes using outdated libraries21:07:29 -ALERTOR1:#tor-alerts- [warning:firing] Host vault-01.torproject.org has processes using outdated libraries21:07:31 -ALERTOR1:#tor-alerts- [warning:firing] Host weather-01.torproject.org has processes using outdated libraries21:07:34 -ALERTOR1:#tor-alerts- [warning:firing] Host web-dal-07.torproject.org has processes using outdated libraries21:07:37 -ALERTOR1:#tor-alerts- [warning:firing] Host web-dal-08.torproject.org has processes using outdated libraries21:07:42 -ALERTOR1:#tor-alerts- [warning:firing] Host web-fsn-01.torproject.org has processes using outdated libraries21:07:45 -ALERTOR1:#tor-alerts- [warning:firing] Host web-fsn-02.torproject.org has processes using outdated libraries
it's clearly not ideal: the output is a little garbled, and still very verbose. but at least it's six lines, and not eighty, every time a host recovers.
The downside is that we don't see which host recovers, just that the counter goes down.
Also, unfortunately, it's not clear at all the counter is correct: it's counting the number of elements in the "alerts" list, but this can actually include resolved alerts! so it's not very accurate. and unfortunately, we're very limited in what we can do inside those golang templates.
for example, to limit the number of hosts shown, i have tried the range $i, $v := .Alerts pattern, but the irc relay fails to start with "unexpected ," or something, really strange.
we'd need helper functions in the format.go code in the relay or, ideally, a way to only send new notifications if they're resolved, or weren't sent already. because the relay doesn't keep state, it probably can't do the latter.
it doesn't help us much, but shows the formatter knows about that status and a bit of the code structure that could allow us to do what we need. it requires a patch however.
another avenue would be to look at matrix notifications (#40216). in that issue, there are many different Matrix relays mentioned, and i added more, but it's not clear to me any of those would address our issues here.
while looking for a matrix bot i have, of course, found another IRC bot as well:
that's something we could use to test our setup, with various payloads. i've been running tcpdump -A -n -i lo port 8099 to look at payloads, but we need something more solid to extract the actual payloads more clearly.
perhaps we could have another webhook endpoint that just logs the payloads, and look a here, tomtom-international/alertmanager-webhook-logger does exactly that! looks pretty trivial too. only one dependent module is missing from debian.
Debian has the webhook thing packaged that we could also use for logging.
i actually went ahead and wrote a trivial logger for now, it's like 20 lines of python. i haven't yet set it up as a service because i'm not sure we want this in the long term. for now it's just logging in a screen(1).
i think i have improved on things quite a bit here.
here's the last notification:
00:29:36 -ALERTOR1:#tor-alerts- warning: OutdatedLibraries for node is firing 8 alerts on fsn-node-01.torproject.org fsn-node-02.torproject.org fsn-node-03.torproject.org fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-07.torproject.org fsn-node-08.torproject.org 00:36:21 -ALERTOR1:#tor-alerts- warning: DRBDDegraded for node is firing 1 alerts on fsn-node-01.torproject.org 00:41:21 -ALERTOR1:#tor-alerts- warning: DRBDDegraded for node is resolved 01:14:37 -ALERTOR1:#tor-alerts- OutdatedLibraries[node] alert warning is firing 7 alerts on fsn-node-01.torproject.org fsn-node-03.torproject.org fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-07.torproject.org fsn-node-08.torproject.org
we don't see colors there, so it's not as great, here's a screenshot:
here you can see the fsn-node-07 server was just taken out of the list of affected servers. the count is not quite right, as technically there are now 6 alerts firing, but it's pretty close.
During a somewhat long drive, that said, I had some ideas.
Our problem here is that we're progressively getting more and more alerts added to the group, and each of those (modulo the group_interval) trigger a notification, with more and more content into it. How do we fix this?
A few ideas, TL;DR:
info severity
group_interval: 24h route
scope=fleet
group_by: version_codename
Long version:
add a new value for the severity label in our alerts, in this case info which doesn't notify on IRC at all (credits to @lelutin for that idea)
This should probably be done in the short term in any case, because reboot runs are just too disruptive, but IMHO not a long term fix because we do want to see this on IRC. Otherwise it creates a dissonance because suddenly IRC stops being a full log of alerts, which is kind of nice to have right now because it's pretty much the only place we have this, short of the JSON dump.
add a new route specifically for this alert, with different group_interval settings.
we could, for example, have a 24h interval so that we wait a really long time before sending duplicate alerts like this.
the downside of that is that it would also take a loooong time for the alerts to be marked as resolved, but i think that's okay because we can always tap in the prom API directly to get the list of servers needing a reboot. that is, actually, how i've done the last reboot runs, by asking prom the list of servers needing a reboot... Not directly related, but it's similar to how host.all-pending-upgrades works, by probing Prom for a server list.
add a scope: fleet label for noisy alerts that likely affect all servers,
This is mainly cosmetic, to avoid routing the alert by name and duplicating the business logic out of prometheus-alerts.git. By having a scope label, we can specify in prometheus-alerts.git how an alert should be routed, instead of having to copy the alert name to the alertmanager config.
This also allows for other similar alerts to be processed correctly. OutdatedLibraries could have a similar scope.
Ideas of scopes:
instance: current configuration, new default
cluster: say, gnt-dal vs gnt-fsn, or web-mirrors or something, theoretical at this point
fleet: applies to all machines, this is what the NeedsReboot alert would be labeled at, and used to route it properly as per step 2
the holy grail: only one notification when such an alert comes in, by grouping per OS version.
in this case, our alert would look like:
NeedsReboot[node/warning] alert is firing 56 machines running bookworm need a reboot
probably not possible to literally do this in the current alertmanager-irc-relay service (if at all), as we send only one alert per group, but we turned that setting on precisely because of this issue in the first place, so if we have proper scoping, perhaps this might just work.
we would need to have the OS version in the alert labels (so probably some join or something) and show it as alias (or, actually, if we stop doing the grouping, then our alerting templates work again, and we just use the right thing in the alert summary annotation
So, in other words, i think we have lots of options here.
I think the first step is probably, as you say, to route the info alerts to devnull (or, more precisely, route them only to the logger, so having one first route before the continue logger). This would also show us whether we can have multiple routes with the same receivers but with different parameters. Then we can try the group_by hack, and finally figure out a magic query that would join on the OS version.
sum by (version_codename) ( node_reboot_required * on(instance) group_left(version_codename) node_os_info)
this currently says:
Element
Value
{version_codename="bookworm"}
56
{version_codename="bullseye"}
0
{version_codename="buster"}
0
i must admit i couldn't figure this out on my own, i was stumbling upon the very obscure vector matching operators and only got as far as node_reboot_required group_left node_os_info, but that was pretty close no? anyways, the above is from GPT-4o, and i think it works! it doesn't have the "normal" label we expect for alert routing though, so we actually need this instead:
sum by (team,job,version_codename) ( node_reboot_required * on(instance) group_left(version_codename) node_os_info)
we can't send the alias or instance though, because then we'd go back to the grouping issue we had. and it would require not grouping alerts anymore...
so anyways, probably something to try next, actually. huge advantage is it doesn't require any extra severity or scope label: we "just" need to change the alerting rule, its template, and the alertmanager-irc-relay configuration.
okay, so @lelutin figured out how to deploy this with unit tests in prometheus-alerts!55 (merged) and has merged the proposed change.. so now i need to deploy this and tweak the irc relay to switch back to one message per alert instead of one per group...
@lelutin i think we'll also need to tweak the OutdatedLibraries alert though, do you think you could look into that one as well now?
i deployed the change in puppet. there was a hunk i missed in the revert which broke the relay, as it was showing plain JSON flooding in the channel, but that has been fixed.
I took a quick look during the reboot run and the silences were getting created, and I could view them in karma
INFO: adding silence from 2025-01-14T22:38:51+00:00 to 2025-01-14T22:44:21+00:00 (0:05:30), created by: tor, comment: silencing all alerts for reboot, matchers: alias=fsn-node-08.torproject.orgINFO: posted silence 54ea8c02-eb79-4108-8423-104fe86a0c92: https://alertmanager.torproject.org/#/silences/54ea8c02-eb79-4108-8423-104fe86a0c92
maybe the expiration delay is not long enough for the unexpectedreboot alert to clear up?