deploy Alertmanager and email notifications on prometheus1

Quote from TPA-RFC-33:

Alerting will be performed by [Alertmanager][], ideally in a high-availability cluster. Fully documenting Alertmanager is out of scope of this document, but a few glossary items seem worth defining here:

alerting rules: rules defined, in PromQL, on the Prometheus server that fire if they are true (e.g. node_reboot_required > 0 for a host requiring a reboot)

alert: an alert sent following an alerting rule "firing" from a Prometheus server

grouping: grouping multiple alerts together in a single notification

inhibition: suppressing notification from an alert if another is already firing, configured in the Alertmanager configuration file

silence: muting an alert for a specific amount of time, configured through the Alertmanager web interface

high availability: support for receiving alerts from multiple Prometheus servers and avoiding duplicate notifications between multiple Alertmanager servers

Configuration

Alertmanager configurations are trickier, as there is no "service discovery" option. Configuration is made of two parts:

alerting rules: PromQL queries that define error conditions that trigger an alert

alerting routes: a map of label/value matches to notification receiver that defines who gets an alert for what

Technically, the alerting rules are actually defined inside the Prometheus server but, for sanity's sake, they are discussed here.

Those are currently managed solely through the [prometheus-alerts][] Git repository. TPA will start adding its own alerting rules through Puppet modules, but the GitLab repository will likely be kept for the foreseeable future, to keep things accessible to service admins.

The rules are currently stored in the rules.d folder in the Git repository. They should be namespaced by team name so that, for example, all TPA rules are prefixed tpa_, to avoid conflicts.

Alert levels

The current noise levels in Icinga are unsustainable and makes alert fatigue such a problem that we often miss critical issues before it's too late. And while Icinga operators (anarcat, in particular, has experience with this) have previously succeeded in reducing the amount of noise from Nagios, we feel a different approach is necessary here.

Each alerting rule MUST be tagged with at least labels:

severity: how important the alert is

team: which teams it belongs to

Here are the severity labels:

warning (new): non-urgent condition, requiring investigation and fixing, but not immediately, no user-visible impact; example: server needs to be rebooted

critical: serious condition with disruptive user-visible impact which requires prompt response; example: donation site gives a 500 error

This distinction is partly inspired from Rob Ewaschuk's Philosophy on Alerting which form the basis of Google's monitoring distributed systems chapter of the Site Reliability Engineering book.

Operators are strongly encourage to drastically limit the number and frequency of critical alerts. If no label is provided, warning will be used.

The team labels should be something like:

anti-censorship

metrics (or network-health?)

TPA (new)

If no team label is defined, CI should yield an error, there will NOT be a default fallback to TPA.

Edited Jun 13, 2024 by anarcat

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information