title: Incident and emergency response: what to do in case of fire

This documentation is for sysadmins to figure out what to do when things go wrong. If you don't have the required accesses and haven't been trained for such situation, you might be better off just trying to wake up someone that can deal with them. See the support documentation instead.

This page lists situations that are not service-specific, generic issue that can happen on any server (or even on your home network). They are, in a sense, the default location for "pager playbooks" that would otherwise live in the service documentation.

Therefore, if the fault concerns a specific service, you will more likely find what you are looking for in the service listing.

Specific situations
Support policies

Specific situations

Server down

If a server is reported as non-responsive, this situation can be caused by:

a network outage at our provider
- sometimes the network outage can be happening between two of our providers so make sure to test network reachability from more than one place on the internet.
RAM and swap being full
the host being offline or crashed

You can first check if it is actually reachable over the network:

ping -4 -c 10 server.torproject.org
ping -6 -c 10 server.torproject.org
ssh server.torproject.org

If it does respond at least from one point on the internet, you can try to diagnose the issue by looking at prometheus and/or Grafana and analyse what, exactly is going on. If you're lucky enough to have SSH access, you can dive deeper in the logs and systemd unit status: for example it might just be that the node exporter has crashed.

If the host does not respond, you should see if it's a virtual machine, and in this case, which server is hosting it. This information is available in howto/ldap (or the web interface, under the physicalHost field). Then login to that server to diagnose this issue. If the physical host is a ganeti node, you can use the serial console and if it's not a ganeti node, you can try to access the console on the hosting provider's web site.

Once you have access to the console, look out for signs of errors like OOM-Kill, disk failures, kernel panics, network-related errors. If you're still able to login and investigate, you might be able to bring the machine back online. Otherwise, look in subsections below for how to perform hard resets.

If the physical host is not responding or is empty (in which case it is a physical host), you need to file a ticket with the upstream provider. That information is available in LDAP, under the physicalHost field.

Check if the parent server is online. If you still can't figure out which server that is, use traceroute or mtr to see the hops to the server. Normally, you should see a reverse DNS matching one of our point of presence. This will also show you whether or not the upstream routers are responsive. This is an example of a healthy trace to fsn-node-01, hosted at Hetzner robot, as seen from the other cluster in Dallas:

root@ssh-dal-01:~# mtr -c 10 -w fsn-node-01.torproject.org
Start: 2024-06-19T18:40:03+0000
HOST: ssh-dal-01                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- gw-01.gnt-dal-01.torproject.org  0.0%    10    0.5   4.2   0.4  35.2  10.9
  2.|-- e0-7.switch3.dal2.he.net        90.0%    10    1.7   1.7   1.7   1.7   0.0
  3.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  5.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  6.|-- port-channel9.core2.par3.he.net  0.0%    10  103.5 105.5 102.3 126.5   7.5
  7.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- hetzner-online.par.franceix.net 10.0%    10  102.0 102.0 101.9 102.2   0.1
  9.|-- core12.nbg1.hetzner.com          0.0%    10  120.6 121.4 120.4 125.5   1.6
 10.|-- core22.fsn1.hetzner.com          0.0%    10  122.9 123.5 122.7 126.2   1.3
 11.|-- 2a01:4f8:0:3::5fe                0.0%    10  122.8 122.8 122.7 123.0   0.1
 12.|-- fsn-node-01.torproject.org       0.0%    10  123.1 123.1 122.8 124.0   0.4

In the above, you can see when the packets leave the continent from Dallas (hop 2) to land in Paris (hop 6), although the other hops in the middle are not responding and therefore hidden.

Here's a healthy trace to a hetzner-hel1-01, hosted in Hetzner cloud:

root@ssh-dal-01:~# mtr -c 10 -w hetzner-hel1-01.torproject.org
Start: 2024-06-19T18:41:22+0000
HOST: ssh-dal-01                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- gw-01.gnt-dal-01.torproject.org  0.0%    10    1.0   0.6   0.4   1.0   0.2
  2.|-- e0-7.switch3.dal2.he.net        70.0%    10    1.2   1.2   1.1   1.3   0.1
  3.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  4.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  5.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  6.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 10.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 11.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 12.|-- 2a03:5f80:4:2::236:87            0.0%    10  130.2 130.9 129.7 138.9   2.8
 13.|-- core32.hel1.hetzner.com          0.0%    10  129.8 129.9 129.7 130.1   0.1
 14.|-- spine16.cloud1.hel1.hetzner.com  0.0%    10  128.5 130.6 128.4 145.2   5.3
 15.|-- spine2.cloud1.hel1.hetzner.com   0.0%    10  129.4 129.6 129.1 131.4   0.7
 16.|-- ???                             100.0    10    0.0   0.0   0.0   0.0   0.0
 17.|-- 12995.your-cloud.host            0.0%    10  128.6 128.7 128.5 129.1   0.2
 18.|-- hetzner-hel1-01.torproject.org   0.0%    10  130.7 131.1 130.3 135.4   1.5

What follows are per-provider instructions:

Hetzner robot (physical servers)

If you're not sure yet whether it's the server or Hetzner, you can use location-specific Hetzner targets:

ash.icmp.hetzner.com
fsn.icmp.hetzner.com
hel.icmp.hetzner.com
nbg.icmp.hetzner.com

... and so on.

If all fails, you can try to reset or reboot the server remotely:

Visit the Heztner Robot server page (password in tor-passwords/hosts-extra-info)
Select the right server (hostname is the second column)
Select the "reset" tab
Select the "Execute an automatic hardware reset" radio button and hit "Send". This is equivalent to hitting the "reset" button on a computer.
Wait for the server to return for a "few" (2? 5? 10? 20?) minutes, depending on how hopeful you are this simple procedure will work.
If that fails, Select the "Order a manual hardware reset" option and hit "Send". This will send an actual human to attend the server and see if they can bring it back online.

If all else fails, Select the "Support" tab and open a support request.

DO NOT file a ticket with support@hetzner.com. That email address is notoriously slow to get an answer from. See incident 40432 for a 3+ days delay.

Hetzner Cloud (virtual servers)

Visit the Hetzner Cloud console (password in tor-passwords/hosts-extra-info)
Select the project (usually "default")
Select the affected server
Open the console (the >_ sign on the top right), and see if there are any error messages and/or if you can login there (using the root password in tor-passwords/hosts)
If that fails, attempt a "Power cycle" in the "Power" tab (on the left)
If that fails, you can also try to boot a rescue system by selecting "Enable Rescue & Power Cycle" in the "Rescue" tab

If all else fails, create a support request. The support menu is in the "Person" menu on the top right of the page.

DO NOT file a ticket with support@hetzner.com. That email address is notoriously slow to get an answer from. See incident 40432 for a 3+ days delay.

Cymru

Open a ticket by writing support@cymru.com.

Sunet / safespring

TBD

Intermittent problems

If you have an intermittent problem that takes a while to manifest itself, you can increase the count. If that takes too long, you can enable "flood" mode which will decrease the interval between packets, waiting less between each failing probe or sending as soon as a reply is received, up to a certain rate.

Here is, for example, a successful 1000 packet ping executed in 100ms:

root@tb-build-03:~# ping -f -c 1000 dal-node-01.torproject.org
PING dal-node-01.torproject.org(2620:7:6002:0:3eec:efff:fed5:6b2a) 56 data bytes
 
--- dal-node-01.torproject.org ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 101ms
rtt min/avg/max/mdev = 0.075/0.086/0.211/0.012 ms, ipg/ewma 0.101/0.095 ms

And here is a failing ping aborted after 14 seconds:

root@tb-build-03:~# ping -f -c 1000 maven.mozilla.org
PING maven.mozilla.org(2600:9000:24f8:fa00:1b:afe8:4000:93a1) 56 data bytes
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................^C
--- maven.mozilla.org ping statistics ---
878 packets transmitted, 0 received, 100% packet loss, time 14031ms

(See tpo/tpa/team#41654 for a discussion and further analysis of that specific issue.)

MTR can help diagnose issues in this case. Vary parameters like IPv6 (-6) or TCP (--tcp). In the above case, the problem could be reproduced with mtr --tcp -6 -c 10 -w maven.mozilla.org.

Tools like curl can also be useful for quick diagnostics, but note that it supports the happy eyeballs standard so it might hide (e.g. IPv6) issues that might otherwise be affecting other clients.

Unexpected reboot

If a host reboots without a manual intervention, there might be different causes for the reboot to happen. Identifying exactly what happened after the fact can be challenging or even in some cases impossible since logs might not have been updated with information about the issues.

But in some cases the logs do have some information. Some things that can be investigated:

syslog. look particularly for disk errors, OOM kill messages close to the reboot, kernel oops messages
dmesg from previous boots, e.g. journaltcl -k -b -1, or see journalctl --list-boots for a list of boot IDs available
smartctl -t long and smartctl -A / nvme [device-self-test|self-test-log] on all devices
/proc/mdadm and /proc/drbd: make sure that replication is still all right

Also note that it's possible this is a spurious warning, or that a host took longer than expected to reboot. Normally, our Fabric reboot procedures issue a silence for the monitoring system to ignore those warnings. It's possible those delays are not appropriate for this host, for example, and might need to be tweaked upwards.

Network-level attacks

If you are sure that a specific $IP is mounting a Denial of Service attack on a server, you can block it with:

iptables -I INPUT -s $IP -j DROP

$IP can also be a network in CIDR notation, e.g. the following drops a whole Google /16 from the host:

iptables -I INPUT -s 74.125.0.0/16 -j DROP

Note that the above inserts (-I) a rule into the rule chain, which puts it before other rules. This is most likely what you want, as it's often possible there's an already existing rule that will allow the traffic through, making a rule appended (-A) to the chain ineffective.

Filesystem set to readonly

If a filesystem is switched to readonly, it prevents any process from writing to the concerned disk, which can have consequences of differing magnitude depending on which volume is readonly.

If Linux automatically changes a filesystem to readonly, it usually indicates that some serious issues were detected with the disk or filesystem. Those can be:

physical drive errors
bad sectors or other detected ongoing data corruption
hard drive driver errors
filesystem corruption

Look out for disk- or filesystem-related errors in:

syslog
dmesg
physical console (e.g. IMPI console)

In some cases with ext4, running fsck can fix issues. However, watch out for files disappearing or being moved to lost+found if the filesystem encounters serious enough inconsistencies.

If the hard disk seems to be showing signs of breakage. Usually that disk will get ejected from the RAID array without blocking the filesystem. However if disk breakage did impact the filesystem consistency and caused it to switch to readonly, migrate the data away from that drive ASAP for example by moving the instance to its secondary node or by rsync'ing it to another machine.

In such a case, you'll also want to review what other instances are currently using the same drive and possibly move all of those instances as well before replacing the drive.

Web server down

Apache web server diagnostics

If you get an alert like ApacheDown, that is:

Apache web server down on test.example.com

It means the apache exporter cannot contact the local web server over its control address http://localhost/server-status/?auto. First, confirm whether this is a problem with the exporter or the entire service, by checking the main service on this host to see if users are affected. If that's the case, prioritize that.

It's possible, for example, that the webserver has crashed for some reason. The best way to figure that out is to check the service status with:

service apache2 status

You should see something like this if the server is running correctly:

● apache2.service - The Apache HTTP Server
     Loaded: loaded (/lib/systemd/system/apache2.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-09-10 14:56:49 UTC; 1 day 5h ago
       Docs: https://httpd.apache.org/docs/2.4/
    Process: 475367 ExecReload=/usr/sbin/apachectl graceful (code=exited, status=0/SUCCESS)
   Main PID: 338774 (apache2)
      Tasks: 53 (limit: 4653)
     Memory: 28.6M
        CPU: 11min 30.297s
     CGroup: /system.slice/apache2.service
             ├─338774 /usr/sbin/apache2 -k start
             └─475411 /usr/sbin/apache2 -k start

Sep 10 17:51:50 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 17:51:50 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 10 19:53:00 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 19:53:00 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 00:00:01 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 00:00:01 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 01:29:29 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 01:29:29 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 19:50:51 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 19:50:51 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.

With the first dot (●) in green and the line Active saying active (running). If it isn't, the logs should show why it failed to start.

It's possible you don't see the right logs in there if the service is stuck in a restart loop. In this case, that use this command instead to see the service logs:

journalctl -b -u apache2

That shows the logs for the server from the last boot.

If the main service is online and it's only the exporter having trouble, try to reproduce the issue with curl from the affected server, for example:

root@test.example.com:~# curl http://localhost/server-status/?auto

Normally, this should work, but it's possible Apache is misconfigured and doesn't listen to localhost for some reason. Look at the apache2ctl -S output, and the rest of the Apache configuration in /etc/apache2, particularly the Ports and Listen directives.

See also the Apache exporter scraping failed instructions in the Prometheus documentation, a related alert.

Disk is full or nearly full

When a disk is filled up to 100% of its capacity, some processes can have issues with continuing to work normally. For example PostgreSQL will purposefully exit when that happens in order to avoid the risk of data corruption. MySQL is not so graceful and it can end up with data corruption in some of its databases.

The first step is to check how long you have. For this, a good tool is the Grafana disk usage dashboard. Select the affected instance, and look at the "change rate" panel, it should show you how much time is left per partition.

To clear up this situation, there are two approaches that can be used in succession:

find what's using disk space and clear out some files
grow the disk

The first thing that should be attempted is to identify where disk space is used and remove some big files that occupy too much space. For example, if the root partition is full, this will show you what is taking up space:

ncdu -x /

Examples

Maybe the syslog grew to ridiculous sizes? Try:

logrotate -f /etc/logrotate.d/syslog-ng

Maybe some users have huge DB dumps laying around in their home directory. After confirming that those files can be deleted:

rm /home/flagada/huge_dump.sql

Maybe the systemd journal has grown too big. This will keep only 500MB:

journalctl --vacuum-size=500M

If in the cleanup phase you can't identify files that can be removed, you'll need to grow the disk. See how to grow disks with ganeti.

Note that it's possible a suddenly growing disk might be a symptom of a larger problem, for example bots crawling a website abusively or an attacker running a denial of service attack. This warrants further (and more complex) investigation, of course, but can be delegated to after the disk usage alert has been handled.

Host clock desynchronized

If a host's clock has drifted and is no longer in sync with the rest of the internet, some really strange things can start happening, like TLS connections failing even though the certificate is still valid.

If a host has time synchronization issues, check that the ntpd service is still running:

systemctl status ntpd.service

You can gather information about which peer servers are drifting:

ntpq -pun

Logs for this service are sent to syslog, so you can take a look there to see if some errors were mentioned.

If restarting the ntpd service does not work, verify that a firewall is not blocking port 123 UDP.

Support policies

Please see TPA-RFC-2: support.

Comments

Please register or sign in to add a comment.

incident response