deploy Alertmanager and email notifications on prometheus1
Quote from TPA-RFC-33:
Alerting will be performed by [Alertmanager][], ideally in a high-availability cluster. Fully documenting Alertmanager is out of scope of this document, but a few glossary items seem worth defining here:
- alerting rules: rules defined, in PromQL, on the Prometheus server that fire if they are true (e.g.
node_reboot_required > 0
for a host requiring a reboot)- alert: an alert sent following an alerting rule "firing" from a Prometheus server
- grouping: grouping multiple alerts together in a single notification
- inhibition: suppressing notification from an alert if another is already firing, configured in the Alertmanager configuration file
- silence: muting an alert for a specific amount of time, configured through the Alertmanager web interface
- high availability: support for receiving alerts from multiple Prometheus servers and avoiding duplicate notifications between multiple Alertmanager servers
Configuration
Alertmanager configurations are trickier, as there is no "service discovery" option. Configuration is made of two parts:
- alerting rules: PromQL queries that define error conditions that trigger an alert
- alerting routes: a map of label/value matches to notification receiver that defines who gets an alert for what
Technically, the alerting rules are actually defined inside the Prometheus server but, for sanity's sake, they are discussed here.
Those are currently managed solely through the [prometheus-alerts][] Git repository. TPA will start adding its own alerting rules through Puppet modules, but the GitLab repository will likely be kept for the foreseeable future, to keep things accessible to service admins.
The rules are currently stored in the
rules.d
folder in the Git repository. They should be namespaced by team name so that, for example, all TPA rules are prefixedtpa_
, to avoid conflicts.Alert levels
The current noise levels in Icinga are unsustainable and makes alert fatigue such a problem that we often miss critical issues before it's too late. And while Icinga operators (anarcat, in particular, has experience with this) have previously succeeded in reducing the amount of noise from Nagios, we feel a different approach is necessary here.
Each alerting rule MUST be tagged with at least labels:
severity
: how important the alert isteam
: which teams it belongs toHere are the
severity
labels:
warning
(new): non-urgent condition, requiring investigation and fixing, but not immediately, no user-visible impact; example: server needs to be rebootedcritical
: serious condition with disruptive user-visible impact which requires prompt response; example: donation site gives a 500 errorThis distinction is partly inspired from Rob Ewaschuk's Philosophy on Alerting which form the basis of Google's monitoring distributed systems chapter of the Site Reliability Engineering book.
Operators are strongly encourage to drastically limit the number and frequency of
critical
alerts. If no label is provided,warning
will be used.The
team
labels should be something like:
anti-censorship
metrics
(ornetwork-health
?)TPA
(new)If no
team
label is defined, CI should yield an error, there will NOT be a default fallback to TPA.