enhance incident response procedures

today we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressful and why not, even more "fun", if i can be so daring.

some background reading:

some ideas:

have an issue template for incidents (so, in git, which requires a git repository here, but maybe it's finally time to merge the wiki repo here anyways), available offline
run simulations/games
have post-mortem templates, here's the pager duty template
gitlab has some incident management primitives including aforementioned "incidents" (which are really just issues)...
... but also integrations which is especially interesting considering they have native Prometheus integration, which might require switching from nagios to prometheus (#29864 (closed))

anyways, the core idea here is:

have incident roles (note-taker, driver, comms, etc)
incident and post-mortem templates
run games

Edited May 28, 2024 by anarcat