Prometheus inhibitions
Quote from TPA-RFC-33:
Alertmanager supports two different concepts for turning off notifications:
silences: operator issued override that turns off notifications for a given amount of time
inhibitions: configured override that turns off notifications for an alert if another alert is already firing
We will make sure we can silence alerts from the Karma dashboard, which should work out of the box. It should also be possible to silence alerts in the built-in Alertmanager web interface, although that might require some manual work to deploy correctly in the Debian package.
By default, silences have a time limit in Alertmanager. If that becomes a problem, we could deploy kthxbye to automatically extend alerts.
The other system, inhibitions, needs configuration to be effective. Micah said it is worth spending at least some time configuring some basic inhibitions to keep major outages from flooding operators with alerts, for example turning off alerts on reboots and so on. There are also ways to write alerting rules that do not need inhibitions at all.
Actual inhibitions we know we need:
-
DiskFull
vsDiskWillFillIn4Hours
(#41736 (closed)) -
HTTPRedirectToHTTPSUnreachable
(critical) should inhibitApacheDown
(or vice-versa, warning, introduced in #41756 (closed))- There's currently no label with a common value between the alerts that we can use to relate them. so implementing this is currently not possible unless we modify labels and find a way to have a common value
-
HTTPRedirectToHTTPSUnreachable
(critical) /HTTPSUnreachable
(critical) /HTTPSResponseDelayExceeded
(warning) probably have some overlap -
node job down vs all the other blackbox (if not all other checks on that alias?) depends on severity? -
other exporters vs node exporter, e.g. we warn about apache exporter being down even if we know node is down -
CiviCRM job and kill switch (see prometheus-alerts@73f62301)- Checked with anarcat: there doesn't seem to be a direct causal link between the two events. When we've seen the kill switch alert recently, the civicrm job was still marked as running
-
DRBDDegraded
and uptimes, whenever we do reboots, we get those -
PgArchiverAge
andPgArchiverFailed
-
OutdatedLibraries
vsNeedsReboot
(#41804 (closed)) -
FullBackupTooOld
should inhibitIncrementalBackupTooOld