outage in gnt-dal cluster

it's not very clear what's happening right now but it looks like there's an ongoing outage in the ganeti dal cluster.

i'll try to document this the best i can, but for now i'll just open this issue.

pending issues:

  • telegram-bot not reachable
  • telegram-bot bacula crashing
  • static-shim down
  • tb-pkg-stage-01 down
  • redis liveness on crm-int-01 from crm-ext-01
  • henryi systemd degraded (unrelated)
  • relay-01 NRPE socket timeout (unrelated)

the symptom, on affected nodes, is that /etc/network/interfaces was corrupt. on tb-pkgstage-01 the file simply a short series of # (ASCII 23) and nothing else. it was also marked as corrupt somehow in the filesystem, and inaccessible until a reboot (which failed and dropped in the initramfs, forcing a fsck). the fsck found the file as dead and moved it to /lost+found where it lies now.

current theory is a memory corruption error, followup tasks:

  • dal-node-03 memtest
  • dal-node-02 evac
  • dal-node-02 memtest
  • dal-node-01 evac
  • dal-node-01 memtest
  • rebalance cluster
  • netconsole setup to send kernel logs do dal-rescue-01
  • DRBD verification test (drbdsetup verify and --verify-alg, @lavamind found issues on gnt-dal but also gnt-fsn)
  • cluster-wide DRBD verification (the above, but for all instances, to be automated in #41225 (closed))
  • enable data-intergrity-alg on a subset of the VMs, at least one VM per node
  • try to reproduce the issue again
  • fix network configuration on the switch (filed as a separate ticket #41226 (closed), reported to quintex)
  • move storage traffic to the intel NIC (eth1 -> eth3)

Current status: mitigation removed after network reconfiguration, watching for signs of recurrence

/cc @lavamind

Edited Oct 02, 2023 by Jérôme Charaoui
Assignee Loading
Time tracking Loading