enhance incident response procedures
today we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressful and why not, even more "fun", if i can be so daring.
some background reading:
- Got game? Secrets of great incident management
- pager duty incident response documentation
- Google SRE book advice
some ideas:
- have an issue template for incidents (so, in git, which requires a git repository here, but maybe it's finally time to merge the wiki repo here anyways), available offline
- run simulations/games
- have post-mortem templates, here's the pager duty template
- gitlab has some incident management primitives including aforementioned "incidents" (which are really just issues)...
- ... but also integrations which is especially interesting considering they have native Prometheus integration, which might require switching from nagios to prometheus (#29864 (closed))
anyways, the core idea here is:
- have incident roles (note-taker, driver, comms, etc)
- incident and post-mortem templates
- run games
Edited by anarcat