it's not very clear what's happening right now but it looks like there's an ongoing outage in the ganeti dal cluster.
i'll try to document this the best i can, but for now i'll just open this issue.
pending issues:
telegram-bot not reachable
telegram-bot bacula crashing
static-shim down
tb-pkg-stage-01 down
redis liveness on crm-int-01 from crm-ext-01
henryi systemd degraded (unrelated)
relay-01 NRPE socket timeout (unrelated)
the symptom, on affected nodes, is that /etc/network/interfaces was corrupt. on tb-pkgstage-01 the file simply a short series of # (ASCII 23) and nothing else. it was also marked as corrupt somehow in the filesystem, and inaccessible until a reboot (which failed and dropped in the initramfs, forcing a fsck). the fsck found the file as dead and moved it to /lost+found where it lies now.
current theory is a memory corruption error, followup tasks:
dal-node-03 memtest
dal-node-02 evac
dal-node-02 memtest
dal-node-01 evac
dal-node-01 memtest
rebalance cluster
netconsole setup to send kernel logs do dal-rescue-01
DRBD verification test (drbdsetup verify and --verify-alg, @lavamind found issues on gnt-dal but alsognt-fsn)
cluster-wide DRBD verification (the above, but for all instances, to be automated in #41225 (closed))
enable data-intergrity-alg on a subset of the VMs, at least one VM per node
try to reproduce the issue again
fix network configuration on the switch (filed as a separate ticket #41226 (closed), reported to quintex)
move storage traffic to the intel NIC (eth1 -> eth3)
Current status: mitigation removed after network reconfiguration, watching for signs of recurrence
It seems like what happened is yesterday @kez did routine reboots for security upgrades, and some machines on gnt-dal, though they rebooted successfully, lost their network access because the /etc/network/interfaces somehow got corrupted.
Restoring this file and rebooting fixed the problem on telegram-bot-01 and static-gitlab-shim.
On static-gitlab-shim, I kept the corrupted /etc/network/interfaces file around as /etc/network/interfaces.bak. It contains only the characters 77 (no line break).
we're taking a short break on this to think over about the root cause. we haven't figured out anything yet, but i made a timeline. it seems unrelated to the datacenter intervention yesterday to deploy dal-rescue-01 (#41135 (closed)), but more related to the ganeti node reboots.
@kez did anything out of the ordinary happen in the gnt-dal cluster reboot?
when i rebooted dal-node-01, the reboot script timed out and i had to manually gnt-instance start all the instances. from the bash history:
for h in dangerzone-01 forum-test-01 gitlab-dev-01 metrics-psqlts-01 onionbalance-02 static-gitlab-shim tb-tester-01 telegram-bot-01 web-dal-07; do gnt-instance start $h.torproject.org; done
they all seemed happy when i left, but i wonder if the reboot was what corrupted those files
The contents of the faulty /etc/network/interfaces on tb-pkgstage-01 looks like syslog journald messages being written in the wrong place on the filesystem.
on tb-pkgstage-01 there was also a corrupted /etc/network/interfaces.d/eth0. it was moved to /lost+found/#377 by fsck (where I renamed it again to keep track of its original name). here's the contents:
Memory test started. It took a while because I mistakenly thought booting the GRML live image would give me the usual menu prompt to go to Memtest but actually we only get that when booting the actual ISO.
So I installed the memtest86+ package and booted into that, but it kept hanging until I realized I should try with SMP off, and now it's finally memtesting!
also, about that particular issue: it seems like it was unrelated. the problem was that the ipsec tunnel was not correctly setup between the crm-int-01 and crm-ext-01 which (naturally) broke the redis backend.
it's not very clear why it happened, but when i tried to manually bring up the tunnel, it failed with some sort of permission denied. i had a hunch this was due to some kernel module not loading, rebooted both boxes and everything came back normally.
there was an outage at Hetzner's vSwitch setup earlier and i suspect this was the cause, possible combined with crm-int-01's reboot. in any case, it's solved now, and shouldn't be considered part of this incident other than it was noticed together...
this outage lasted:
Date: Tue, 16 May 2023 20:03:17 +0000Date: Wed, 17 May 2023 15:24:47 +0000
there was an outage at Hetzner's vSwitch setup earlier and i suspect this was the cause, possible combined with crm-int-01's reboot. in any case, it's solved now, and shouldn't be considered part of this incident other than it was noticed together...
i opened #41178 (closed) about the hetzner outage and after making the timeline I noticed the hetzner outage happened after the outage started here. the hetzner outage happened between 2023-05-17 12:18UTC and 12:41UTC, while the problem here started a full 16 hours earlier and ended 3 hours later.
so the crm-int outage does not seem related to the hetzner issue at all.