draft: group reboot alerts by OS version instead of alias
Reboot alerts are extremely noisy right now. They first start making noises when a first node is noticed as requiring a reboot, but then the alerts get added again and again as the fleet updates its kernel packages and notices the pending reboot.
We don't really need to know which individual servers to reboot. I mean, we do want to operate on those servers, but what we actually want to know is:
-
okay, this is the bookworm DSA i've just seen fly by, is the whole fleet ready for reboot yet?
-
no? okay, let's wait for the upgrades to propagate. (possibly more alerts come in here as more servers come in, but perhaps not.)
-
yes? okay wait, should we also wait for the bullseye DLA to come in so we just upgrade everything at once?
-
wait for the second alert, for bullseye, to come in.
With this, in theory, we turn the massive reboot notification floods into one or two alerts, depending on how many Debian suites get their kernels updated, or how many architectures get their microcodes kicked.
This requires the alertmanager-irc-relay to be changed so that it
starts sending individual alerts again, i.e. turning
msg_once_per_alert_group
back to false
.
See team#41745 (closed).
Also note that this doesn't fix all issues we're having here, because
disabling that msg_once_per_alert_group
setting will trigger alert
floods for other fleet-wide things, like OutdatedLibraries. But
alert could also be similalry tweaked, once we're happy with the
NeedsReboot alert.
In any case, this is a draft because we need to tweak the IRC relay config in conjunction with this deployment, and I really need a review from another brain before pushing this.