add runbooks for every alert in Prometheus

every alert in Prometheus should have a runbook, and we should enforce that in CI. the runbook (also called "pager playbook") should have clear instructions on how to deal with an alert, including, if necessary, things like adding capacity, debugging, restoring from backups, and so on.

note that we already have rules with TODO as a runbook..

instructions can assume technical knowledge, but need to be step-by-step enough that a tired and busy sysadmin can follow them without risking too much making a mistake.

so, checklist:

  • enforce a valid URL in alerting rules through CI
  • make a list of all alerts missing a runbook, add it to this checklist, possibly by first making the list through CI
  • write runbook for alert X (replace this item with a list of all alerts needing to be written)

/cc @lelutin

Edited by anarcat