enforce severity and team labels, file namespaces
We want to have a more systematic approach to how alerts are labeled and managed here.
TPA-RFC-33 proposed to have specific labels attached to each alert, make sure we have at least:
-
severity
:warning
orcritical
-
team
: currentlyTPA
,anti-censorship
,network-health
andnetwork
The latter was performed by hand recently, and I think most rules have the former as well, but we should validate that in CI.
I am considering also enforcing having a runbook
annotation so that
each alert has an associated link to a "pager playbook", but I'm not
sure that should be enforced. It should at least be given as an
example.
The RFC also suggested namespacing the file names by team, but I'm not sure that's entirely necessary (or compatible with the current approach of namespacing by endpoint path). See also the discussion about exporter policy in team#41280.
Quote from TPA-RFC-33:
Configuration
Alertmanager configurations are trickier, as there is no "service discovery" option. Configuration is made of two parts:
- alerting rules: PromQL queries that define error conditions that trigger an alert
- alerting routes: a map of label/value matches to notification receiver that defines who gets an alert for what
Technically, the alerting rules are actually defined inside the Prometheus server but, for sanity's sake, they are discussed here.
Those are currently managed solely through the [prometheus-alerts][] Git repository. TPA will start adding its own alerting rules through Puppet modules, but the GitLab repository will likely be kept for the foreseeable future, to keep things accessible to service admins.
The rules are currently stored in the
rules.d
folder in the Git repository. They should be namespaced by team name so that, for example, all TPA rules are prefixedtpa_
, to avoid conflicts.Alert levels
The current noise levels in Icinga are unsustainable and makes alert fatigue such a problem that we often miss critical issues before it's too late. And while Icinga operators (anarcat, in particular, has experience with this) have previously succeeded in reducing the amount of noise from Nagios, we feel a different approach is necessary here.
Each alerting rule MUST be tagged with at least labels:
severity
: how important the alert isteam
: which teams it belongs toHere are the
severity
labels:
warning
(new): non-urgent condition, requiring investigation and fixing, but not immediately, no user-visible impact; example: server needs to be rebootedcritical
: serious condition with disruptive user-visible impact which requires prompt response; example: donation site gives a 500 errorThis distinction is partly inspired from Rob Ewaschuk's Philosophy on Alerting which form the basis of Google's monitoring distributed systems chapter of the Site Reliability Engineering book.
Operators are strongly encourage to drastically limit the number and frequency of
critical
alerts. If no label is provided,warning
will be used.The
team
labels should be something like:
anti-censorship
metrics
(ornetwork-health
?)TPA
(new)If no
team
label is defined, CI should yield an error, there will NOT be a default fallback to TPA.
concretely, we have the following checklist here:
-
enforce severity
labels on alerting rules -
enforce team
labels on alerting rules (or targets? how do we check that chain?) -
check routing: ensure that known teams don't route to fallback and that, inversely, a non-existent team falls back, can be done with amtool
, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus/#testing-alerts -
check TPA routing: ensure that warning routes only go to dashboard/irc, for example -
while we're here, add a check in CI to make sure we have a, we already have #16 filed for thisrunbook
anotation for each alert
We already have some checks in .gitlab-ci.yml
. Specifically we check:
- that the targets configuration is valid (with
promtool check config prometheus-ci.yml
) - that all passwords in the that configuration are redacted
- basic linting with cloudflare/pint
A lot of the checklist above can probably be implemented with pint). In general, we should take a second look at pint because i suspect we might be missing on some goodies it does. It has a lot of checks that I didn't realize existed and could be useful for us, and may not already be enabled.