Skip to content

draft: group reboot alerts by OS version instead of alias

anarcat requested to merge reboot-groups into main

Reboot alerts are extremely noisy right now. They first start making noises when a first node is noticed as requiring a reboot, but then the alerts get added again and again as the fleet updates its kernel packages and notices the pending reboot.

We don't really need to know which individual servers to reboot. I mean, we do want to operate on those servers, but what we actually want to know is:

  1. okay, this is the bookworm DSA i've just seen fly by, is the whole fleet ready for reboot yet?

  2. no? okay, let's wait for the upgrades to propagate. (possibly more alerts come in here as more servers come in, but perhaps not.)

  3. yes? okay wait, should we also wait for the bullseye DLA to come in so we just upgrade everything at once?

  4. wait for the second alert, for bullseye, to come in.

With this, in theory, we turn the massive reboot notification floods into one or two alerts, depending on how many Debian suites get their kernels updated, or how many architectures get their microcodes kicked.

This requires the alertmanager-irc-relay to be changed so that it starts sending individual alerts again, i.e. turning msg_once_per_alert_group back to false.

See team#41745 (closed).

Also note that this doesn't fix all issues we're having here, because disabling that msg_once_per_alert_group setting will trigger alert floods for other fleet-wide things, like OutdatedLibraries. But alert could also be similalry tweaked, once we're happy with the NeedsReboot alert.

In any case, this is a draft because we need to tweak the IRC relay config in conjunction with this deployment, and I really need a review from another brain before pushing this.

Merge request reports

Loading