outage in gnt-dal cluster
it's not very clear what's happening right now but it looks like there's an ongoing outage in the ganeti dal cluster.
i'll try to document this the best i can, but for now i'll just open this issue.
pending issues:
-
telegram-bot not reachable -
telegram-bot bacula crashing -
static-shim down -
tb-pkg-stage-01 down -
redis liveness on crm-int-01 from crm-ext-01 -
henryi systemd degraded(unrelated) -
relay-01 NRPE socket timeout(unrelated)
the symptom, on affected nodes, is that /etc/network/interfaces was corrupt. on tb-pkgstage-01
the file simply a short series of #
(ASCII 23) and nothing else. it was also marked as corrupt somehow in the filesystem, and inaccessible until a reboot (which failed and dropped in the initramfs, forcing a fsck). the fsck found the file as dead and moved it to /lost+found
where it lies now.
current theory is a memory corruption error, followup tasks:
-
dal-node-03 memtest -
dal-node-02 evac -
dal-node-02 memtest -
dal-node-01 evac -
dal-node-01 memtest -
rebalance cluster -
netconsole setup to send kernel logs do dal-rescue-01
-
DRBD verification test ( drbdsetup verify
and--verify-alg
, @lavamind found issues ongnt-dal
but alsognt-fsn
) -
cluster-wide DRBD verification (the above, but for all instances, to be automated in #41225 (closed)) -
enable data-intergrity-alg
on a subset of the VMs, at least one VM per node -
try to reproduce the issue again -
fix network configuration on the switch (filed as a separate ticket #41226 (closed), reported to quintex) -
move storage traffic to the intel NIC ( eth1
->eth3
)
Current status: mitigation removed after network reconfiguration, watching for signs of recurrence
/cc @lavamind