routing issue on nagios server

Post-mortem

Timeline

2021-09-30 ~19:00 UTC: Nagios notices something is wrong with the FSN cluster
2021-10-01 ~18:00 UTC: anarcat notices the outage in Nagios, files this ticket
2021-10-01 18:19UTC: ticket 2021100103024963 filed at Hetzner
2021-10-02 13:36UTC: fsn-node-06 rebooted, no effect
2021-10-02 14:28UTC: bacula-director-01 migrated to fsn-node-03, backup outage partially resolved (for unaffected machines)
2021-10-02 15:04UTC: rouyi migrated to fsn-node-01, confirming the outage is node-specific
2021-10-03 00:04UTC: manual VLAN configuration failure
2021-10-04 12:47UTC: Hetzner replies, asks to file another ticket
2021-10-05 14:32UTC: ticket 2021100403021166 filed with Hetzner, call with Helsinki datacenter, no progress
2021-10-05 14:42UTC: hetzner network team replies, asking for more information
2021-10-05 14:54UTC: another reply from Hetzner network, blaming firewall rules
2021-10-05 15:16UTC: network team found problem with vswitch 5391, suggests a reboot
2021-10-05 15:33UTC: Nagios can reach alberti
2021-10-05 17:10UTC: static mirrors resync'd
2021-10-05 17:42UTC: post-mortem written, incident closed

Root cause analysis

A failure in Hetzner's "vswitch" service caused heterogeneous routing problems between Scandinavia (Helsinki, Findland and Stockholm, Sweden at least) and our main point of presence (PoP) at Falkenstein (FSN1-DC13, specifically).

A ticket was filed with support@hetzner.com on Friday, but should have been filed through the "Robot" interface instead.

What went right

most of the infrastructure kept working correctly, and the outage was limited to a fairly narrow number of machines and locations

Lessons learned

we need to file tickets with Hetzner robot, not support@hetzner
Hetzner's vswitch can fail in mysterious ways: if packets do not reach our interface, they need to "restart the vswitch router"

Followup work

#40434 (closed): document that support@hetzner is crap

Overview

This is an overview of the outage, as of 2021-10-02 15:00UTC (by @anarcat):

32 instances in the gnt-fsn cluster are affected, on all nodes but fsn-node-03 and fsn-node-05, bizarrely
the phenomenon happens only when connected from ipnett.se (sweden) or Hetzner's Helsinki datacenters, and only when connecting to the above instances (and specifically only when they are on the affected nodes: if they are migrated to a "healthy" node, the problem eventually goes away)
backups are down on the instances, because bungei cannot connect to the bacula file servers on the instances
monitoring is down, ie. nagios cannot reach the instances
no IRC notifications: nagios cannot reach the "nsa" bot either, and I actually can't figure out how that even works in the first place
no email notifications: nagios cannot send email notifications because eugeni is down (might be worth letting nagios send his own mail...)
since Nagios cannot reach the puppet server either, it looks like NRPE is down on many more hosts, but it's not: it's just pauli's NRPE that's unreacheable
some instances (to be clarified) cannot run puppet, as pauli is one of the affected nodes
the mirror network may be affected: nagios is warning about "CRITICAL: 116.202.120.165 broken: 500 Can't connect to www.torproject.org:443 (Connection timed out), 1 mirror(s) not in sync (from oldest to newest): 116.202.120.165", but this could be just nagios failing to run its check
~~we suspect a problem with openvswitch, but cannot trace a cause on our side~~ we follow the hetzner documentation to configure a vswitch on fsn-node-07 and there are still routing problems, therefore the problem is not on our end.

So far, we haven't received any complaints from users about problems with the infrastructure, so this could just be a fluke inside hetzner or limited to Sweden/Norway.

First ticket at hetzner

i filed this ticket with Hetzner:

Hi,

For about 23 hours now, our monitoring server has been seeing routing
issues to other machines in the Hetzner network. Affected machines
include:

 * alberti.torproject.org
 * eugeni.torproject.org
 * web-fsn-01.torproject.org

You will notice those machines actually respond to pings from
elsewhere. For example, I can ping them from my home in Canada (IP
206.248.172.91):

anarcat@angela:~(main)$ ping -4 -c 3 alberti.torproject.org
PING  (49.12.57.132) 56(84) bytes of data.
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=1 ttl=48 time=104 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=2 ttl=48 time=106 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=3 ttl=48 time=107 ms

---  ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 104.379/105.589/106.818/0.995 ms
"ping -c 3 alberti.torproject.org" took 5 mins

... but from our monitoring server in the Hetzner cloud, that machine is
unreachable:

root@hetzner-hel1-01:~# ping -c 3 alberti.torproject.org
PING alberti.torproject.org (49.12.57.132) 56(84) bytes of data.

--- alberti.torproject.org ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 17ms

The machine affected is hetzner-hel1-01.torproject.org. A traceroute
stops at a Juniper router inside your network, which makes me feel this
is routing issue on Hetzner's end:

root@hetzner-hel1-01:~# traceroute alberti.torproject.org
traceroute to alberti.torproject.org (49.12.57.132), 30 hops max, 60 byte packets
 1  172.31.1.1 (172.31.1.1)  11.665 ms  11.821 ms  11.746 ms
 2  15271.your-cloud.host (95.216.132.232)  0.396 ms  0.398 ms  0.751 ms
 3  * * *
 4  static.88.198.252.117.clients.your-server.de (88.198.252.117)  1.508 ms  1.668 ms static.88.198.252.113.clients.your-server.de (88.198.252.113)  1.554 ms
 5  static.88-198-245-253.clients.your-server.de (88.198.245.253)  1.127 ms core32.hel1.hetzner.com (88.198.249.93)  0.582 ms core31.hel1.hetzner.com (88.198.249.89)  24.830 ms
 6  juniper1.dc11.fsn1.hetzner.com (213.239.245.166)  15.955 ms  15.781 ms  15.776 ms
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

Interestingly, many of the machines affected are in our Ganeti cluster
hosted in Falkenstein, under the 7 machines named
fsn-node-01.torproject.org through fsn-node-07. Those machines,
surprisingly, can ping fine from the monitoring server:

root@hetzner-hel1-01:~# ping -c 3 fsn-node-01.torproject.org
PING fsn-node-01.torproject.org (88.198.8.185) 56(84) bytes of data.
64 bytes from fsn-node-01.torproject.org (88.198.8.185): icmp_seq=1 ttl=56 time=25.2 ms
64 bytes from fsn-node-01.torproject.org (88.198.8.185): icmp_seq=2 ttl=56 time=25.3 ms
64 bytes from fsn-node-01.torproject.org (88.198.8.185): icmp_seq=3 ttl=56 time=25.1 ms

--- fsn-node-01.torproject.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 25.092/25.216/25.328/0.207 ms

And the route takes a slightly different path:

root@hetzner-hel1-01:~# traceroute fsn-node-01.torproject.org
traceroute to fsn-node-01.torproject.org (88.198.8.185), 30 hops max, 60 byte packets
 1  172.31.1.1 (172.31.1.1)  4.640 ms  4.402 ms  4.544 ms
 2  15271.your-cloud.host (95.216.132.232)  0.487 ms  0.598 ms  0.560 ms
 3  * * *
 4  static.88.198.252.117.clients.your-server.de (88.198.252.117)  0.988 ms  0.913 ms static.88.198.252.113.clients.your-server.de (88.198.252.113)  0.916 ms
 5  core31.hel1.hetzner.com (88.198.249.89)  0.762 ms  0.681 ms  0.621 ms
 6  core8.fra.hetzner.com (213.239.224.149)  20.311 ms core9.fra.hetzner.com (213.239.224.170)  23.245 ms core8.fra.hetzner.com (213.239.224.153)  20.146 ms
 7  core1.fra.hetzner.com (213.239.245.125)  20.769 ms  37.895 ms core5.fra.hetzner.com (213.239.224.218)  20.423 ms
 8  * * *
 9  ex9k2.dc13.fsn1.hetzner.com (213.239.224.6)  25.182 ms ex9k2.dc13.fsn1.hetzner.com (213.239.224.2)  25.085 ms ex9k2.dc13.fsn1.hetzner.com (213.239.224.6)  26.317 ms
10  fsn-node-01.torproject.org (88.198.8.185)  25.402 ms  25.348 ms  25.348 ms

I'll also note that the machines can be pinged from the virtualizer, so
I don't feel it's a routing issue on our end, but I could be mistaken:

root@fsn-node-01:~# ping -4  -c 3 alberti.torproject.org
PING alberti.torproject.org (49.12.57.132) 56(84) bytes of data.
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=1 ttl=61 time=0.520 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=2 ttl=61 time=0.520 ms
64 bytes from alberti.torproject.org (49.12.57.132): icmp_seq=3 ttl=61 time=0.565 ms

--- alberti.torproject.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 10ms
rtt min/avg/max/mdev = 0.520/0.535/0.565/0.021 ms

Running a tcpdump on the targets doesn't reveal incoming packets, so
they don't seem to hit the network switch fsn-node-01 is attached to.

Thank you for any ideas

there are also email warnings from Bacula, it's unclear if they are related to this directly.

Edited Oct 04, 2021 by anarcat