add pager playbooks for every alert in Prometheus
every alert in Prometheus should have a playbook, and we should enforce that in CI. the playbook (also called "pager playbook") should have clear instructions on how to deal with an alert, including, if necessary, things like adding capacity, debugging, restoring from backups, and so on.
note that we already have rules with TODO as a playbook..
instructions can assume technical knowledge, but need to be step-by-step enough that a tired and busy sysadmin can follow them without risking too much making a mistake.
so, checklist:
-
enforce a valid URL in alerting rules through CI -
when all playbook
annotations exist for all rules (even non-TPA ones), remove the value "TODO" from valid values in pint's config -- we may need to move this point to its own task since it's possible that getting all teams to add a playbook might take a while.- anti-censorship has a plan that we can follow along: tpo/anti-censorship/team#140
-
-
make a list of all alerts missing a playbook, add it to this checklist, possibly by first making the list through CI (list can be obtained with git grep 'playbook: "TODO'
) -
write playbook for alert X (replace this item with a list of all alerts needing to be written) - in
tpa_node
-
disk full (cf 0f9f0d4c) -
howto/upgrades need a cleanup
-
- in
tpa_blackbox
-
X509CertNearExpired -
SSHUnreachable -
HTTP* -
SMTPUnreachable
-
-
all non-TPA rules (delegated to service admins)
- in
/cc @lelutin
Edited by anarcat