routing issue on nagios server
Post-mortem
Timeline
- 2021-09-30 ~19:00 UTC: Nagios notices something is wrong with the FSN cluster
- 2021-10-01 ~18:00 UTC: anarcat notices the outage in Nagios, files this ticket
- 2021-10-01 18:19UTC: ticket 2021100103024963 filed at Hetzner
- 2021-10-02 13:36UTC: fsn-node-06 rebooted, no effect
-
2021-10-02 14:28UTC: bacula-director-01 migrated to
fsn-node-03
, backup outage partially resolved (for unaffected machines) -
2021-10-02 15:04UTC: rouyi migrated to
fsn-node-01
, confirming the outage is node-specific - 2021-10-03 00:04UTC: manual VLAN configuration failure
- 2021-10-04 12:47UTC: Hetzner replies, asks to file another ticket
- 2021-10-05 14:32UTC: ticket 2021100403021166 filed with Hetzner, call with Helsinki datacenter, no progress
- 2021-10-05 14:42UTC: hetzner network team replies, asking for more information
- 2021-10-05 14:54UTC: another reply from Hetzner network, blaming firewall rules
- 2021-10-05 15:16UTC: network team found problem with vswitch 5391, suggests a reboot
- 2021-10-05 15:33UTC: Nagios can reach alberti
- 2021-10-05 17:10UTC: static mirrors resync'd
- 2021-10-05 17:42UTC: post-mortem written, incident closed
Root cause analysis
A failure in Hetzner's "vswitch" service caused heterogeneous routing problems between Scandinavia (Helsinki, Findland and Stockholm, Sweden at least) and our main point of presence (PoP) at Falkenstein (FSN1-DC13, specifically).
A ticket was filed with support@hetzner.com on Friday, but should have been filed through the "Robot" interface instead.
What went right
- most of the infrastructure kept working correctly, and the outage was limited to a fairly narrow number of machines and locations
Lessons learned
- we need to file tickets with Hetzner robot, not support@hetzner
- Hetzner's vswitch can fail in mysterious ways: if packets do not reach our interface, they need to "restart the vswitch router"
Followup work
- #40434 (closed): document that support@hetzner is crap
Overview
This is an overview of the outage, as of 2021-10-02 15:00UTC (by @anarcat):
- 32 instances in the gnt-fsn cluster are affected, on all nodes but fsn-node-03 and fsn-node-05, bizarrely
- the phenomenon happens only when connected from ipnett.se (sweden) or Hetzner's Helsinki datacenters, and only when connecting to the above instances (and specifically only when they are on the affected nodes: if they are migrated to a "healthy" node, the problem eventually goes away)
- backups are down on the instances, because bungei cannot connect to the bacula file servers on the instances
- monitoring is down, ie. nagios cannot reach the instances
- no IRC notifications: nagios cannot reach the "nsa" bot either, and I actually can't figure out how that even works in the first place
- no email notifications: nagios cannot send email notifications because eugeni is down (might be worth letting nagios send his own mail...)
- since Nagios cannot reach the puppet server either, it looks like NRPE is down on many more hosts, but it's not: it's just pauli's NRPE that's unreacheable
- some instances (to be clarified) cannot run puppet, as
pauli
is one of the affected nodes - the mirror network may be affected: nagios is warning about "CRITICAL: 116.202.120.165 broken: 500 Can't connect to www.torproject.org:443 (Connection timed out), 1 mirror(s) not in sync (from oldest to newest): 116.202.120.165", but this could be just nagios failing to run its check
-
we suspect a problem with openvswitch, but cannot trace a cause on our sidewe follow the hetzner documentation to configure a vswitch on fsn-node-07 and there are still routing problems, therefore the problem is not on our end.
So far, we haven't received any complaints from users about problems with the infrastructure, so this could just be a fluke inside hetzner or limited to Sweden/Norway.
First ticket at hetzner
i filed this ticket with Hetzner:
Hi,
For about 23 hours now, our monitoring server has been seeing routing
issues to other machines in the Hetzner network. Affected machines
include:
* alberti.torproject.org
* eugeni.torproject.org
* web-fsn-01.torproject.org
You will notice those machines actually respond to pings from
elsewhere. For example, I can ping them from my home in Canada (IP
206.248.172.91):
anarcat@angela:~(main)$ ping -4 -c 3 alberti.torproject.org
PING (49.12.57.132) 56(84) bytes of data.
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=1 ttl=48 time=104 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=2 ttl=48 time=106 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=3 ttl=48 time=107 ms
--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 104.379/105.589/106.818/0.995 ms
"ping -c 3 alberti.torproject.org" took 5 mins
... but from our monitoring server in the Hetzner cloud, that machine is
unreachable:
root@hetzner-hel1-01:~# ping -c 3 alberti.torproject.org
PING alberti.torproject.org (49.12.57.132) 56(84) bytes of data.
--- alberti.torproject.org ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 17ms
The machine affected is hetzner-hel1-01.torproject.org. A traceroute
stops at a Juniper router inside your network, which makes me feel this
is routing issue on Hetzner's end:
root@hetzner-hel1-01:~# traceroute alberti.torproject.org
traceroute to alberti.torproject.org (49.12.57.132), 30 hops max, 60 byte packets
1 172.31.1.1 (172.31.1.1) 11.665 ms 11.821 ms 11.746 ms
2 15271.your-cloud.host (95.216.132.232) 0.396 ms 0.398 ms 0.751 ms
3 * * *
4 static.88.198.252.117.clients.your-server.de (88.198.252.117) 1.508 ms 1.668 ms static.88.198.252.113.clients.your-server.de (88.198.252.113) 1.554 ms
5 static.88-198-245-253.clients.your-server.de (88.198.245.253) 1.127 ms core32.hel1.hetzner.com (88.198.249.93) 0.582 ms core31.hel1.hetzner.com (88.198.249.89) 24.830 ms
6 juniper1.dc11.fsn1.hetzner.com (213.239.245.166) 15.955 ms 15.781 ms 15.776 ms
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
Interestingly, many of the machines affected are in our Ganeti cluster
hosted in Falkenstein, under the 7 machines named
fsn-node-01.torproject.org through fsn-node-07. Those machines,
surprisingly, can ping fine from the monitoring server:
root@hetzner-hel1-01:~# ping -c 3 fsn-node-01.torproject.org
PING fsn-node-01.torproject.org (88.198.8.185) 56(84) bytes of data.
64 bytes from fsn-node-01.torproject.org (88.198.8.185): icmp_seq=1 ttl=56 time=25.2 ms
64 bytes from fsn-node-01.torproject.org (88.198.8.185): icmp_seq=2 ttl=56 time=25.3 ms
64 bytes from fsn-node-01.torproject.org (88.198.8.185): icmp_seq=3 ttl=56 time=25.1 ms
--- fsn-node-01.torproject.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 25.092/25.216/25.328/0.207 ms
And the route takes a slightly different path:
root@hetzner-hel1-01:~# traceroute fsn-node-01.torproject.org
traceroute to fsn-node-01.torproject.org (88.198.8.185), 30 hops max, 60 byte packets
1 172.31.1.1 (172.31.1.1) 4.640 ms 4.402 ms 4.544 ms
2 15271.your-cloud.host (95.216.132.232) 0.487 ms 0.598 ms 0.560 ms
3 * * *
4 static.88.198.252.117.clients.your-server.de (88.198.252.117) 0.988 ms 0.913 ms static.88.198.252.113.clients.your-server.de (88.198.252.113) 0.916 ms
5 core31.hel1.hetzner.com (88.198.249.89) 0.762 ms 0.681 ms 0.621 ms
6 core8.fra.hetzner.com (213.239.224.149) 20.311 ms core9.fra.hetzner.com (213.239.224.170) 23.245 ms core8.fra.hetzner.com (213.239.224.153) 20.146 ms
7 core1.fra.hetzner.com (213.239.245.125) 20.769 ms 37.895 ms core5.fra.hetzner.com (213.239.224.218) 20.423 ms
8 * * *
9 ex9k2.dc13.fsn1.hetzner.com (213.239.224.6) 25.182 ms ex9k2.dc13.fsn1.hetzner.com (213.239.224.2) 25.085 ms ex9k2.dc13.fsn1.hetzner.com (213.239.224.6) 26.317 ms
10 fsn-node-01.torproject.org (88.198.8.185) 25.402 ms 25.348 ms 25.348 ms
I'll also note that the machines can be pinged from the virtualizer, so
I don't feel it's a routing issue on our end, but I could be mistaken:
root@fsn-node-01:~# ping -4 -c 3 alberti.torproject.org
PING alberti.torproject.org (49.12.57.132) 56(84) bytes of data.
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=1 ttl=61 time=0.520 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=2 ttl=61 time=0.520 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=3 ttl=61 time=0.565 ms
--- alberti.torproject.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 10ms
rtt min/avg/max/mdev = 0.520/0.535/0.565/0.021 ms
Running a tcpdump on the targets doesn't reveal incoming packets, so
they don't seem to hit the network switch fsn-node-01 is attached to.
Thank you for any ideas
there are also email warnings from Bacula, it's unclear if they are related to this directly.
Edited by anarcat