Skip to content

enhance incident response procedures

today we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressful and why not, even more "fun", if i can be so daring.

some background reading:

some ideas:

  • have an issue template for incidents (so, in git, which requires a git repository here, but maybe it's finally time to merge the wiki repo here anyways), available offline
  • run simulations/games
  • have post-mortem templates, here's the pager duty template
  • gitlab has some incident management primitives including aforementioned "incidents" (which are really just issues)...
  • ... but also integrations which is especially interesting considering they have native Prometheus integration, which might require switching from nagios to prometheus (#29864 (closed))

anyways, the core idea here is:

  1. have incident roles (note-taker, driver, comms, etc)
  2. incident and post-mortem templates
  3. run games
Edited by anarcat
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information