we had two major outages in #40816 (closed): one was clearly due to a failed open vswitch upgrade, but the second is not so clear cut. we had a weird outage, complete with trash in /var/log/syslog, which seem to indicate a more serious hardware failure.
we also had a disk failure on that node (#40805 (closed)) which we have originally discarded as a fluke (because a reboot fixed it), but now it seems we should more carefully evaluate that critical node.
for now, the Ganeti master has been failed over to fsn-node-02 and fsn-node-01 is marked as "drained" so it doesn't have any primary instances. this hurts the cluster "score" but we should be able to survive in this condition while we perform diagnostics.
another possibility is that this machine reached its end of life: it was setup in 2019-07-23 and maybe we should just order a new one to replace it.
there are three possible outcomes here:
we replace broken parts in fsn-node-01 (e.g. the disk)
one more data point here. it's possible one of the VMs on fsn-node-01 caused the crash. i noticed something weird in the overview dashboard:
notice that sudden load drop? well that was neriniflorum. check this shit out, its load peaked at 250:
it looks like it got into a fork spiral:
... which of course affected CPU usage and memory:
more interestingly, neriniflorum is not the only VM that got into that state. looking at the VMs that were primary on fsn-node-01 at the point, we had eugeni, staticiforme, static-master-fsn, neriniflorum and tb-build-01. out of those, you had similar sympyoms on staticiforme, static-master-fsn, but not tb-build-01 and eugeni, which completely blanked out. prometheus just wasn't able to scrape those targets at all. it looked like this on eugeni:
notice the spike in CPU usage before the outage, though. it's possible it suffered the same fate as the others, and it's just it suffered so much more badly than it couldn't report anything.
so I don't know. this could be another case of the mysterious NVMe disk failure? who knows... the iowait is telling...
also, for what that's worth, i recovered the smartmon and nvme exporters in puppet. this was somehow forgotten after the bullseye upgrade which meant that we didn't collect smartctl and NVMe stats in prometheus anymore. the smartmon textfile dashboard was broken too, but that's no great loss, that thing barely worked in the first place and obviously needs a lot of love. (it has, for example, no notion of NVMe metrics at all, and finds a lot of false positives on VMs)
I think option 3 is out of the question. If we've had to drain fsn-node-01 then I don't think we can conclude things are just "normal". It sounds like replacing the disk might be a fix, and it's also cheaper than replacing the entire server. It might be worth it to try replacing the disk first, and building a new machine if the disk wasn't the issue.
that went without a hitch, migrating colchicifolium.torproject.org as well to remove this warning from gnt-cluster verify (although i doubt it will fix it for fsn-node-04, because that's its secondary...)
So I looked at fsn-node-01 this morning and nothing seemed amiss. The nvme error-log command doesn't show anything abnormal reported by the disks themselves, which appear healthy. If the disk or storage subsystem on this computer was faulty, it's likely DRBD would notice because even as secondaries the disks are still being kept busy.
I'd also mention that the ^@^@^@^@^@^@ garbage in /var/log/syslog is actually a bunch of zeroes, and this happens when a crash occurs as the logfile is being written to. It's not necessarily a telltale sign of hardware problems. So it's plausible the last crash was indeed caused by the VMs, possibly also worsened by the vswitch stuff going down?
Anyway, I agree with @kez we shouldn't just put the node back into production and pretend everything is OK, so I'd suggest we run memory diagnostics, and if everything looks OK, migrate one or two VMs back and wait a week.
While I was looking around I managed to trigger a new but harmless entry in the error log of /dev/nvme0, with the nvme get-log command and bogus parameters. So the latest warning sent via email can be ignored.
So I'd be ready to file a ticket with Hetzner and launch the memory diag using pcmemtest, but I'm not sure whether it would be OK for the DRBD secondaries to be offline for 24 hours or more while the test runs. We'd lose redundancy and the resync might take a long time when the node comes back up. So maybe I should move the secondaries off of fsn-node-01 before?
With the secondary DRBD volumes now in sync, I've un-drained fsn-node-01 and have started migrating static-master-fsn back to it. The plan is to let the node run that VM for a few days, and if that works, migrate another one.
Seems fair. I would however point out that one of the NVMe drives did
disappear in the past. It returned after a reboot, but that is a
disturbing pattern, especially when combined with the other failures we
had in this cluster related to this ticket.
Maybe that's not something to take care of directly in this ticket,
but I think we should seriously consider retiring the first batch of
gnt-fsn machines. We could order two new fsn-node* boxes, set them up,
and retire fsn-node-01 and -02 afterwards. This could be done in our
precious spare time. I estimate it would take about 2 days of work, and
would give us much more peace of mind to not have to deal with such
hardware issues in the future. Hardware cost would be close to nil:
~150EUR setup fees per box, if i remember correctly.
And this is something we have to do at some point anyways in the future,
so let's make sure we have that scheduled before we close this ticket,
even if the above strategy is succesful in (slowly) making fsn-node-01
join back in the cluster.
a.
...
On 2022-07-06 15:45:14, Jérôme Charaoui (@lavamind) wrote:
With the secondary DRBD volumes now in sync, I've un-drained fsn-node-01 and have started migrating static-master-fsn back to it. The plan is to let the node run that VM for a few days, and if that works, migrate another one.
--
Antoine Beaupré
torproject.org system administration
Seems fair. I would however point out that one of the NVMe drives did disappear in the past. It returned after a reboot, but that is a disturbing pattern, especially when combined with the other failures we had in this cluster related to this ticket.
Which of those other failures were established to be caused by hardware? As far as I know, none, so that's why I'm kind of on the fence about it.
Maybe that's not something to take care of directly in this ticket, but I think we should seriously consider retiring the first batch of gnt-fsn machines.
I'd be curious to know what's the reasoning here, because I don't consider server hardware to be "old" after 3 years of use, that seems arbitrarily short? Don't get me wrong I'm all in favor of getting rid of faulty hardware, but beside the thing with the NVMe disk, my understanding is the other failures were caused by misbehaving software.
Seems fair. I would however point out that one of the NVMe drives did disappear in the past. It returned after a reboot, but that is a disturbing pattern, especially when combined with the other failures we had in this cluster related to this ticket.
Which of those other failures were established to be caused by hardware? As far as I know, none, so that's why I'm kind of on the fence about it.
a disk disappeaering from the bus completely certainly feels like a
hardware failure.
i also haven't ruled out a software or usage problem on the issue here,
but i find it really odd that all those VMs would simultaneously fail
all at once, all on the same ganeti node. that feels hardware related to
me, because it's specific to that one node.
Maybe that's not something to take care of directly in this ticket, but I think we should seriously consider retiring the first batch of gnt-fsn machines.
I'd be curious to know what's the reasoning here, because I don't consider server hardware to be "old" after 3 years of use, that seems arbitrarily short? Don't get me wrong I'm all in favor of getting rid of faulty hardware, but beside the thing with the NVMe disk, my understanding is the other failures were caused by misbehaving software.
I don't know... we don't actually know what the physical age of this
server is. we set it up in 2019, but it could very well have been
already setup long before. maybe something worth investigating.
...
On 2022-07-09 22:33:37, Jérôme Charaoui (@lavamind) wrote:
Antoine Beaupré
torproject.org system administration
I looked into updating the firmware on the Samsung PM983 NMVe drives but Samsung doesn't appear to offer those on its storage support page, only the tool to install the update itself, the "Samsung DC Toolkit".
I followed up with a phone call to Samsung support about upgrading the firmware (I couldn't find any email address, go figure) and the response was the serial number on the drive doesn't end with a letter, and thus it's an OEM product that not eligible for any kind of support or updates from Samsung. I have no idea who the OEM supplier is for this part, only Hetzner would know, so I don't think this is really worth exploring further.
No issues in the last few days, so I migrated materculae back to fsn-node-01: the node now has 2 running instances. Bonus points, the cluster now verifies without errors.