outage in gnt-dal cluster
it's not very clear what's happening right now but it looks like there's an ongoing outage in the ganeti dal cluster.
i'll try to document this the best i can, but for now i'll just open this issue.
pending issues:
- 
telegram-bot not reachable 
- 
telegram-bot bacula crashing 
- 
static-shim down 
- 
tb-pkg-stage-01 down 
- 
redis liveness on crm-int-01 from crm-ext-01 
- 
henryi systemd degraded(unrelated)
- 
relay-01 NRPE socket timeout(unrelated)
the symptom, on affected nodes, is that /etc/network/interfaces was corrupt. on tb-pkgstage-01 the file simply a short series of # (ASCII 23) and nothing else. it was also marked as corrupt somehow in the filesystem, and inaccessible until a reboot (which failed and dropped in the initramfs, forcing a fsck). the fsck found the file as dead and moved it to /lost+found where it lies now.
current theory is a memory corruption error, followup tasks:
- 
dal-node-03 memtest 
- 
dal-node-02 evac 
- 
dal-node-02 memtest 
- 
dal-node-01 evac 
- 
dal-node-01 memtest 
- 
rebalance cluster 
- 
netconsole setup to send kernel logs do dal-rescue-01
- 
DRBD verification test ( drbdsetup verifyand--verify-alg, @lavamind found issues ongnt-dalbut alsognt-fsn)
- 
cluster-wide DRBD verification (the above, but for all instances, to be automated in #41225 (closed)) 
- 
enable data-intergrity-algon a subset of the VMs, at least one VM per node
- 
try to reproduce the issue again 
- 
fix network configuration on the switch (filed as a separate ticket #41226 (closed), reported to quintex) 
- 
move storage traffic to the intel NIC ( eth1->eth3)
Current status: mitigation removed after network reconfiguration, watching for signs of recurrence
/cc @lavamind