Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • TPA team TPA team
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 175
    • Issues 175
    • List
    • Boards
    • Service Desk
    • Milestones
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
  • Wiki
    • Wiki
  • Activity
  • Create a new issue
  • Issue Boards
Collapse sidebar
  • The Tor Project
  • TPA
  • TPA teamTPA team
  • Issues
  • #40421
Closed
Open
Created Sep 28, 2021 by anarcat@anarcatOwner

enhance incident response procedures

today we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressful and why not, even more "fun", if i can be so daring.

some background reading:

  • Got game? Secrets of great incident management
  • pager duty incident response documentation

some ideas:

  • have an issue template for incidents (so, in git, which requires a git repository here, but maybe it's finally time to merge the wiki repo here anyways), available offline
  • run simulations/games
  • have post-mortem templates
  • gitlab has some incident management primitives including aforementioned "incidents" (which are really just issues)...
  • ... but also integrations which is especially interesting considering they have native Prometheus integration, which might require switching from nagios to prometheus (#29864)

anyways, the core idea here is:

  1. have incident roles (note-taker, driver, comms, etc)
  2. incident and post-mortem templates
  3. run games
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking