outage in gnt-dal cluster

it's not very clear what's happening right now but it looks like there's an ongoing outage in the ganeti dal cluster.

i'll try to document this the best i can, but for now i'll just open this issue.

pending issues:

telegram-bot not reachable
telegram-bot bacula crashing
static-shim down
tb-pkg-stage-01 down
redis liveness on crm-int-01 from crm-ext-01
~~henryi systemd degraded~~ (unrelated)
~~relay-01 NRPE socket timeout~~ (unrelated)

the symptom, on affected nodes, is that /etc/network/interfaces was corrupt. on tb-pkgstage-01 the file simply a short series of # (ASCII 23) and nothing else. it was also marked as corrupt somehow in the filesystem, and inaccessible until a reboot (which failed and dropped in the initramfs, forcing a fsck). the fsck found the file as dead and moved it to /lost+found where it lies now.

current theory is a memory corruption error, followup tasks:

dal-node-03 memtest
dal-node-02 evac
dal-node-02 memtest
dal-node-01 evac
dal-node-01 memtest
rebalance cluster
netconsole setup to send kernel logs do dal-rescue-01
DRBD verification test (drbdsetup verify and --verify-alg, @lavamind found issues on gnt-dal but also gnt-fsn)
cluster-wide DRBD verification (the above, but for all instances, to be automated in #41225 (closed))
enable data-intergrity-alg on a subset of the VMs, at least one VM per node
try to reproduce the issue again
fix network configuration on the switch (filed as a separate ticket #41226 (closed), reported to quintex)
move storage traffic to the intel NIC (eth1 -> eth3)

Current status: mitigation removed after network reconfiguration, watching for signs of recurrence

/cc @lavamind

Edited Oct 02, 2023 by Jérôme Charaoui

Assignee Loading

Time tracking Loading

Confidentiality

Confidentiality controls have moved to the issue actions menu () at the top of the page.