enforce severity and team labels, file namespaces

We want to have a more systematic approach to how alerts are labeled and managed here.

TPA-RFC-33 proposed to have specific labels attached to each alert, make sure we have at least:

severity: warning or critical
team: currently TPA, anti-censorship, network-health and network

The latter was performed by hand recently, and I think most rules have the former as well, but we should validate that in CI.

I am considering also enforcing having a runbook annotation so that each alert has an associated link to a "pager playbook", but I'm not sure that should be enforced. It should at least be given as an example.

The RFC also suggested namespacing the file names by team, but I'm not sure that's entirely necessary (or compatible with the current approach of namespacing by endpoint path). See also the discussion about exporter policy in team#41280.

Quote from TPA-RFC-33:

Configuration

Alertmanager configurations are trickier, as there is no "service discovery" option. Configuration is made of two parts:

alerting rules: PromQL queries that define error conditions that trigger an alert

alerting routes: a map of label/value matches to notification receiver that defines who gets an alert for what

Technically, the alerting rules are actually defined inside the Prometheus server but, for sanity's sake, they are discussed here.

Those are currently managed solely through the [prometheus-alerts][] Git repository. TPA will start adding its own alerting rules through Puppet modules, but the GitLab repository will likely be kept for the foreseeable future, to keep things accessible to service admins.

The rules are currently stored in the rules.d folder in the Git repository. They should be namespaced by team name so that, for example, all TPA rules are prefixed tpa_, to avoid conflicts.

Alert levels

The current noise levels in Icinga are unsustainable and makes alert fatigue such a problem that we often miss critical issues before it's too late. And while Icinga operators (anarcat, in particular, has experience with this) have previously succeeded in reducing the amount of noise from Nagios, we feel a different approach is necessary here.

Each alerting rule MUST be tagged with at least labels:

severity: how important the alert is

team: which teams it belongs to

Here are the severity labels:

warning (new): non-urgent condition, requiring investigation and fixing, but not immediately, no user-visible impact; example: server needs to be rebooted

critical: serious condition with disruptive user-visible impact which requires prompt response; example: donation site gives a 500 error

This distinction is partly inspired from Rob Ewaschuk's Philosophy on Alerting which form the basis of Google's monitoring distributed systems chapter of the Site Reliability Engineering book.

Operators are strongly encourage to drastically limit the number and frequency of critical alerts. If no label is provided, warning will be used.

The team labels should be something like:

anti-censorship

metrics (or network-health?)

TPA (new)

If no team label is defined, CI should yield an error, there will NOT be a default fallback to TPA.

concretely, we have the following checklist here:

enforce severity labels on alerting rules
enforce team labels on alerting rules (or targets? how do we check that chain?), implemented in 1cc48d46
check routing: ensure that known teams don't route to fallback and that, inversely, a non-existent team falls back, can be done with amtool, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus/#testing-alerts
check TPA routing: ensure that warning routes only go to dashboard/irc, for example
while we're here, ~~add a check in CI to make sure we have a runbook anotation for each alert~~, we already have #16 filed for this

We already have some checks in .gitlab-ci.yml. Specifically we check:

that the targets configuration is valid (with promtool check config prometheus-ci.yml)
that all passwords in the that configuration are redacted
basic linting with cloudflare/pint

A lot of the checklist above can probably be implemented with pint). In general, we should take a second look at pint because i suspect we might be missing on some goodies it does. It has a lot of checks that I didn't realize existed and could be useful for us, and may not already be enabled.

Edited Feb 03, 2025 by lelutin