title: Incident and emergency response: what to do in case of fire
This documentation is for sysadmins to figure out what to do when things go wrong. If you don't have the required accesses and haven't been trained for such situation, you might be better off just trying to wake up someone that can deal with them. See the support documentation instead.
This page lists situations that are not service-specific, generic issue that can happen on any server (or even on your home network). They are, in a sense, the default location for "pager playbooks" that would otherwise live in the service documentation.
Therefore, if the fault concerns a specific service, you will more likely find what you are looking for in the service listing.
Specific situations
Server down
If a server is reported as non-responsive, this situation can be caused by:
- a network outage at our provider
- sometimes the network outage can be happening between two of our providers so make sure to test network reachability from more than one place on the internet.
- RAM and swap being full
- the host being offline or crashed
You can first check if it is actually reachable over the network:
ping -4 -c 10 server.torproject.org
ping -6 -c 10 server.torproject.org
ssh server.torproject.org
If it does respond at least from one point on the internet, you can try to diagnose the issue by looking at prometheus and/or Grafana and analyse what, exactly is going on. If you're lucky enough to have SSH access, you can dive deeper in the logs and systemd unit status: for example it might just be that the node exporter has crashed.
If the host does not respond, you should see if it's a virtual
machine, and in this case, which server is hosting it. This
information is available in howto/ldap (or the web
interface, under the physicalHost
field). Then login to that server to diagnose this issue. If the physical host
is a ganeti node, you can use the serial console
and if it's not a ganeti node, you can try to access the console on the hosting
provider's web site.
Once you have access to the console, look out for signs of errors like OOM-Kill, disk failures, kernel panics, network-related errors. If you're still able to login and investigate, you might be able to bring the machine back online. Otherwise, look in subsections below for how to perform hard resets.
If the physical host is not responding or is empty (in which case it
is a physical host), you need to file a ticket with the upstream
provider. That information is available in LDAP, under the
physicalHost
field.
Check if the parent server is online. If you still can't figure out
which server that is, use traceroute
or mtr
to see the hops to the
server. Normally, you should see a reverse DNS matching one of our
point of presence. This will also show you whether or not the upstream
routers are responsive. This is an example of a healthy trace to
fsn-node-01
, hosted at Hetzner robot, as seen from the other cluster
in Dallas:
root@ssh-dal-01:~# mtr -c 10 -w fsn-node-01.torproject.org
Start: 2024-06-19T18:40:03+0000
HOST: ssh-dal-01 Loss% Snt Last Avg Best Wrst StDev
1.|-- gw-01.gnt-dal-01.torproject.org 0.0% 10 0.5 4.2 0.4 35.2 10.9
2.|-- e0-7.switch3.dal2.he.net 90.0% 10 1.7 1.7 1.7 1.7 0.0
3.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
4.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
5.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
6.|-- port-channel9.core2.par3.he.net 0.0% 10 103.5 105.5 102.3 126.5 7.5
7.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
8.|-- hetzner-online.par.franceix.net 10.0% 10 102.0 102.0 101.9 102.2 0.1
9.|-- core12.nbg1.hetzner.com 0.0% 10 120.6 121.4 120.4 125.5 1.6
10.|-- core22.fsn1.hetzner.com 0.0% 10 122.9 123.5 122.7 126.2 1.3
11.|-- 2a01:4f8:0:3::5fe 0.0% 10 122.8 122.8 122.7 123.0 0.1
12.|-- fsn-node-01.torproject.org 0.0% 10 123.1 123.1 122.8 124.0 0.4
In the above, you can see when the packets leave the continent from Dallas (hop 2) to land in Paris (hop 6), although the other hops in the middle are not responding and therefore hidden.
Here's a healthy trace to a hetzner-hel1-01
, hosted in Hetzner cloud:
root@ssh-dal-01:~# mtr -c 10 -w hetzner-hel1-01.torproject.org
Start: 2024-06-19T18:41:22+0000
HOST: ssh-dal-01 Loss% Snt Last Avg Best Wrst StDev
1.|-- gw-01.gnt-dal-01.torproject.org 0.0% 10 1.0 0.6 0.4 1.0 0.2
2.|-- e0-7.switch3.dal2.he.net 70.0% 10 1.2 1.2 1.1 1.3 0.1
3.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
4.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
5.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
6.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
7.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
8.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
9.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
10.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
11.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
12.|-- 2a03:5f80:4:2::236:87 0.0% 10 130.2 130.9 129.7 138.9 2.8
13.|-- core32.hel1.hetzner.com 0.0% 10 129.8 129.9 129.7 130.1 0.1
14.|-- spine16.cloud1.hel1.hetzner.com 0.0% 10 128.5 130.6 128.4 145.2 5.3
15.|-- spine2.cloud1.hel1.hetzner.com 0.0% 10 129.4 129.6 129.1 131.4 0.7
16.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
17.|-- 12995.your-cloud.host 0.0% 10 128.6 128.7 128.5 129.1 0.2
18.|-- hetzner-hel1-01.torproject.org 0.0% 10 130.7 131.1 130.3 135.4 1.5
What follows are per-provider instructions:
Hetzner robot (physical servers)
If you're not sure yet whether it's the server or Hetzner, you can use location-specific Hetzner targets:
ash.icmp.hetzner.com
fsn.icmp.hetzner.com
hel.icmp.hetzner.com
nbg.icmp.hetzner.com
... and so on.
If all fails, you can try to reset or reboot the server remotely:
- Visit the Heztner Robot server page (password in
tor-passwords/hosts-extra-info
) - Select the right server (hostname is the second column)
- Select the "reset" tab
- Select the "Execute an automatic hardware reset" radio button and hit "Send". This is equivalent to hitting the "reset" button on a computer.
- Wait for the server to return for a "few" (2? 5? 10? 20?) minutes, depending on how hopeful you are this simple procedure will work.
- If that fails, Select the "Order a manual hardware reset" option and hit "Send". This will send an actual human to attend the server and see if they can bring it back online.
If all else fails, Select the "Support" tab and open a support request.
DO NOT file a ticket with support@hetzner.com
. That email address
is notoriously slow to get an answer from. See incident 40432 for
a 3+ days delay.
Hetzner Cloud (virtual servers)
- Visit the Hetzner Cloud console (password in
tor-passwords/hosts-extra-info
) - Select the project (usually "default")
- Select the affected server
- Open the console (the
>_
sign on the top right), and see if there are any error messages and/or if you can login there (using the root password intor-passwords/hosts
) - If that fails, attempt a "Power cycle" in the "Power" tab (on the left)
- If that fails, you can also try to boot a rescue system by selecting "Enable Rescue & Power Cycle" in the "Rescue" tab
If all else fails, create a support request. The support menu is in the "Person" menu on the top right of the page.
DO NOT file a ticket with support@hetzner.com
. That email address
is notoriously slow to get an answer from. See incident 40432 for
a 3+ days delay.
Cymru
Open a ticket by writing support@cymru.com.
Sunet / safespring
TBD
Intermittent problems
If you have an intermittent problem that takes a while to manifest itself, you can increase the count. If that takes too long, you can enable "flood" mode which will decrease the interval between packets, waiting less between each failing probe or sending as soon as a reply is received, up to a certain rate.
Here is, for example, a successful 1000 packet ping executed in 100ms:
root@tb-build-03:~# ping -f -c 1000 dal-node-01.torproject.org
PING dal-node-01.torproject.org(2620:7:6002:0:3eec:efff:fed5:6b2a) 56 data bytes
--- dal-node-01.torproject.org ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 101ms
rtt min/avg/max/mdev = 0.075/0.086/0.211/0.012 ms, ipg/ewma 0.101/0.095 ms
And here is a failing ping aborted after 14 seconds:
root@tb-build-03:~# ping -f -c 1000 maven.mozilla.org
PING maven.mozilla.org(2600:9000:24f8:fa00:1b:afe8:4000:93a1) 56 data bytes
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................^C
--- maven.mozilla.org ping statistics ---
878 packets transmitted, 0 received, 100% packet loss, time 14031ms
(See tpo/tpa/team#41654 for a discussion and further analysis of that specific issue.)
MTR can help diagnose issues in this case. Vary parameters like IPv6
(-6
) or TCP (--tcp
). In the above case, the problem could be
reproduced with mtr --tcp -6 -c 10 -w maven.mozilla.org
.
Tools like curl
can also be useful for quick diagnostics, but note
that it supports the happy eyeballs standard so it might hide
(e.g. IPv6) issues that might otherwise be affecting other clients.
Unexpected reboot
If a host reboots without a manual intervention, there might be different causes for the reboot to happen. Identifying exactly what happened after the fact can be challenging or even in some cases impossible since logs might not have been updated with information about the issues.
But in some cases the logs do have some information. Some things that can be investigated:
- syslog. look particularly for disk errors, OOM kill messages close to the reboot, kernel oops messages
- dmesg from previous boots, e.g.
journaltcl -k -b -1
, or seejournalctl --list-boots
for a list of boot IDs available -
smartctl -t long
andsmartctl -A
/nvme [device-self-test|self-test-log]
on all devices -
/proc/mdadm
and/proc/drbd
: make sure that replication is still all right
Also note that it's possible this is a spurious warning, or that a host took longer than expected to reboot. Normally, our Fabric reboot procedures issue a silence for the monitoring system to ignore those warnings. It's possible those delays are not appropriate for this host, for example, and might need to be tweaked upwards.
Network-level attacks
If you are sure that a specific $IP
is mounting a Denial of Service
attack on a server, you can block it with:
iptables -I INPUT -s $IP -j DROP
$IP
can also be a network in CIDR notation, e.g. the following drops
a whole Google /16 from the host:
iptables -I INPUT -s 74.125.0.0/16 -j DROP
Note that the above inserts (-I
) a rule into the rule chain, which
puts it before other rules. This is most likely what you want, as
it's often possible there's an already existing rule that will allow
the traffic through, making a rule appended (-A
) to the chain
ineffective.
See also our nftables documentation.
Filesystem set to readonly
If a filesystem is switched to readonly, it prevents any process from writing to the concerned disk, which can have consequences of differing magnitude depending on which volume is readonly.
If Linux automatically changes a filesystem to readonly, it usually indicates that some serious issues were detected with the disk or filesystem. Those can be:
- physical drive errors
- bad sectors or other detected ongoing data corruption
- hard drive driver errors
- filesystem corruption
Look out for disk- or filesystem-related errors in:
- syslog
- dmesg
- physical console (e.g. IMPI console)
In some cases with ext4, running fsck can fix issues. However, watch out for
files disappearing or being moved to lost+found
if the filesystem encounters
serious enough inconsistencies.
If the hard disk seems to be showing signs of breakage. Usually that disk will get ejected from the RAID array without blocking the filesystem. However if disk breakage did impact the filesystem consistency and caused it to switch to readonly, migrate the data away from that drive ASAP for example by moving the instance to its secondary node or by rsync'ing it to another machine.
In such a case, you'll also want to review what other instances are currently using the same drive and possibly move all of those instances as well before replacing the drive.
Web server down
Apache web server diagnostics
If you get an alert like ApacheDown
, that is:
Apache web server down on test.example.com
It means the apache exporter cannot contact the local web server
over its control address
http://localhost/server-status/?auto
. First, confirm whether this is
a problem with the exporter or the entire service, by checking the
main service on this host to see if users are affected. If that's the
case, prioritize that.
It's possible, for example, that the webserver has crashed for some reason. The best way to figure that out is to check the service status with:
service apache2 status
You should see something like this if the server is running correctly:
● apache2.service - The Apache HTTP Server
Loaded: loaded (/lib/systemd/system/apache2.service; enabled; preset: enabled)
Active: active (running) since Tue 2024-09-10 14:56:49 UTC; 1 day 5h ago
Docs: https://httpd.apache.org/docs/2.4/
Process: 475367 ExecReload=/usr/sbin/apachectl graceful (code=exited, status=0/SUCCESS)
Main PID: 338774 (apache2)
Tasks: 53 (limit: 4653)
Memory: 28.6M
CPU: 11min 30.297s
CGroup: /system.slice/apache2.service
├─338774 /usr/sbin/apache2 -k start
└─475411 /usr/sbin/apache2 -k start
Sep 10 17:51:50 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 17:51:50 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 10 19:53:00 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 10 19:53:00 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 00:00:01 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 00:00:01 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 01:29:29 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 01:29:29 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
Sep 11 19:50:51 donate-01 systemd[1]: Reloading apache2.service - The Apache HTTP Server...
Sep 11 19:50:51 donate-01 systemd[1]: Reloaded apache2.service - The Apache HTTP Server.
With the first dot (●
) in green and the line Active
saying active (running)
. If it isn't, the logs should show why it failed to
start.
It's possible you don't see the right logs in there if the service is stuck in a restart loop. In this case, that use this command instead to see the service logs:
journalctl -b -u apache2
That shows the logs for the server from the last boot.
If the main service is online and it's only the exporter having trouble, try to reproduce the issue with curl from the affected server, for example:
root@test.example.com:~# curl http://localhost/server-status/?auto
Normally, this should work, but it's possible Apache is misconfigured
and doesn't listen to localhost
for some reason. Look at the
apache2ctl -S
output, and the rest of the Apache configuration in
/etc/apache2
, particularly the Ports
and Listen
directives.
See also the Apache exporter scraping failed instructions in the Prometheus documentation, a related alert.
Disk is full or nearly full
When a disk is filled up to 100% of its capacity, some processes can have issues with continuing to work normally. For example PostgreSQL will purposefully exit when that happens in order to avoid the risk of data corruption. MySQL is not so graceful and it can end up with data corruption in some of its databases.
The first step is to check how long you have. For this, a good tool is the Grafana disk usage dashboard. Select the affected instance, and look at the "change rate" panel, it should show you how much time is left per partition.
To clear up this situation, there are two approaches that can be used in succession:
- find what's using disk space and clear out some files
- grow the disk
The first thing that should be attempted is to identify where disk space is used and remove some big files that occupy too much space. For example, if the root partition is full, this will show you what is taking up space:
ncdu -x /
Examples
Maybe the syslog grew to ridiculous sizes? Try:
logrotate -f /etc/logrotate.d/syslog-ng
Maybe some users have huge DB dumps laying around in their home directory. After confirming that those files can be deleted:
rm /home/flagada/huge_dump.sql
Maybe the systemd journal has grown too big. This will keep only 500MB:
journalctl --vacuum-size=500M
If in the cleanup phase you can't identify files that can be removed, you'll need to grow the disk. See how to grow disks with ganeti.
Note that it's possible a suddenly growing disk might be a symptom of a larger problem, for example bots crawling a website abusively or an attacker running a denial of service attack. This warrants further (and more complex) investigation, of course, but can be delegated to after the disk usage alert has been handled.
Other documentation:
Host clock desynchronized
If a host's clock has drifted and is no longer in sync with the rest of the internet, some really strange things can start happening, like TLS connections failing even though the certificate is still valid.
If a host has time synchronization issues, check that the ntpd
service is
still running:
systemctl status ntpd.service
You can gather information about which peer servers are drifting:
ntpq -pun
Logs for this service are sent to syslog, so you can take a look there to see if some errors were mentioned.
If restarting the ntpd service does not work, verify that a firewall is not blocking port 123 UDP.
Support policies
Please see TPA-RFC-2: support.