incident-response.mdwn

[[!meta title="Incident and emergency response: what to do in case of fire"]]

This documentation is for sysadmins to figure out what to do when
things go wrong. If you don't have the required accesses and haven't
been trained for such situation, you might be better off just trying
to wake up someone that can deal with them. See the
[[doc/how-to-get-help]] documentation instead.

[[!toc levels=3]]

Specific situations
===================

Server down
-----------

If a server is non-responsive, you can first check if it is actually
reachable over the network:

    ping -c 10 server.torproject.org

If it does respond, you can try to diagnose the issue by looking at
[Nagios][] and/or [Grafana](https://grafana.torproject.org) and analyse what, exactly is going on.

[Nagios]: https://nagios.torproject.org

If it does *not* respond, you should see if it's a virtual machine,
and in this case, which server is hosting it. This information is
available in [[ldap]]
(or [the web interface](https://db.torproject.org/machines.cgi), under the
`physicalHost` field). Then login to that server to diagnose this
issue.

If the physical host is not responding or is empty (in which case it
*is* a physical host), you need to file a ticket with the upstream
provider. This information is available in [Nagios][]: 

 1. search for the server name in the search box
 2. click on the server
 3. drill down the "Parents" until you find something that ressembles
    a hosting provider (e.g. `hetzner-hel1-01` is Hetzner, `gw-cymru`
    is Cymru, `gw-scw-*` are at Scaleway, `gw-sunet` is Sunet)

What follows are per-provider instructions:

### Hetzner robot (physical servers)

 1. Visit the [Heztner Robot server page](https://robot.your-server.de/server) (password in
    `tor-passwords/hosts-extra-info`)
 2. Select the right server (hostname is the second column)
 3. Select the "reset" tab
 4. Select the "Execute an automatic hardware reset" radio button and
    hit "Send". This is equivalent to hitting the "reset" button on a
    computer.
 5. Wait for the server to return for a "few" (2? 5? 10? 20?) minutes,
    depending on how hopeful you are this simple procedure will work.
 6. If that fails, Select the "Order a manual hardware reset" option
    and hit "Send". This will send an actual human to attend the
    server and see if they can bring it back online.

If all else fails, Select the "Support" tab and open a support
request.

### Hetzner Cloud (virtual servers)

 1. Visit the [Hetzner Cloud console](https://console.hetzner.cloud/) (password in
    `tor-passwords/hosts-extra-info`)
 2. Select the project (usually "default")
 3. Select the affected server
 4. Open the console (the `>_` sign on the top right), and see if
    there are any error messages and/or if you can login there (using
    the root password in `tor-passwords/hosts`)
 5. If that fails, attempt a "Power cycle" in the "Power" tab (on the
    left)
 6. If that fails, you can also try to boot a rescue system by
    selecting "Enable Rescue & Power Cycle" in the "Rescue" tab

If all else fails, create a support request. The support menu is in
the "Person" menu on the top right of the page.

### Cymru

Open a ticket by writing <support@cymru.com>.

### Sunet

TBD

Emergency policies
==================

Those still need to be defined more clearly, but we can consider there
are three "support levels" for emergencies:

 * code red: house is on fire, go go go
 * code yellow: houston, we have a problem, but we'll live for a day
 * routine: file a bug report, we'll get to it soon!

Code red
--------

A "code red" is a critical condition that requires immediate
action. It's what we consider an "emergency".

Code yellow
-----------

A "code yellow" is a situation where we are overwhelmed but there
isn't exactly an immediate emergency to deal with. There's a separate
process, called a "[code yellow](https://devops.com/code-yellow-when-operations-isnt-perfect/)" ([SRECON19 presentation](https://www.usenix.org/conference/srecon19americas/presentation/kehoe),
[slides](https://www.usenix.org/sites/default/files/conference/protected-files/sre19amer_slides_kehoe.pdf)), as opposed to a code red, above, which we might want to
consider for fixing longer term issues.

Routine
-------

TBD.