evaluate the health of fsn-node-01

changed the description

marked this issue as related to #40816 (closed)

one more data point here. it's possible one of the VMs on fsn-node-01 caused the crash. i noticed something weird in the overview dashboard:

notice that sudden load drop? well that was neriniflorum. check this shit out, its load peaked at 250:

it looks like it got into a fork spiral:

... which of course affected CPU usage and memory:

more interestingly, neriniflorum is not the only VM that got into that state. looking at the VMs that were primary on fsn-node-01 at the point, we had eugeni, staticiforme, static-master-fsn, neriniflorum and tb-build-01. out of those, you had similar sympyoms on staticiforme, static-master-fsn, but not tb-build-01 and eugeni, which completely blanked out. prometheus just wasn't able to scrape those targets at all. it looked like this on eugeni:

notice the spike in CPU usage before the outage, though. it's possible it suffered the same fate as the others, and it's just it suffered so much more badly than it couldn't report anything.

so I don't know. this could be another case of the mysterious NVMe disk failure? who knows... the iowait is telling...

also, for what that's worth, i recovered the smartmon and nvme exporters in puppet. this was somehow forgotten after the bullseye upgrade which meant that we didn't collect smartctl and NVMe stats in prometheus anymore. the smartmon textfile dashboard was broken too, but that's no great loss, that thing barely worked in the first place and obviously needs a lot of love. (it has, for example, no notion of NVMe metrics at all, and finds a lot of false positives on VMs)

I think option 3 is out of the question. If we've had to drain fsn-node-01 then I don't think we can conclude things are just "normal". It sounds like replacing the disk might be a fix, and it's also cheaper than replacing the entire server. It might be worth it to try replacing the disk first, and building a new machine if the disk wasn't the issue.

i've seen a worrisome swap warning on fsn-node-02 now:

17:12:03 <nsa> tor-nagios: [fsn-node-02] swap usage - percent is CRITICAL: SWAP CRITICAL - 1% free (0 MB out of 1023 MB)

that's about 3 hours ago, America/Eastern.

i'm migrating web-fsn-01 to secondary (fsn-node-08) to try to release some of that pressure.

that went without a hitch, migrating colchicifolium.torproject.org as well to remove this warning from gnt-cluster verify (although i doubt it will fix it for fsn-node-04, because that's its secondary...)

this are slightly better. before:

root@fsn-node-02:~# gnt-cluster verify
Submitted jobs 556454, 556455
Waiting for job 556454 ...
Thu Jun 30 00:52:27 2022 * Verifying cluster config
Thu Jun 30 00:52:27 2022 * Verifying cluster certificate files
Thu Jun 30 00:52:27 2022 * Verifying hypervisor parameters
Thu Jun 30 00:52:27 2022 * Verifying all nodes belong to an existing group
Waiting for job 556455 ...
Thu Jun 30 00:52:28 2022 * Verifying group 'default'
Thu Jun 30 00:52:28 2022 * Gathering data (8 nodes)
Thu Jun 30 00:52:28 2022 * Gathering information about nodes (8 nodes)
Thu Jun 30 00:52:30 2022 * Gathering disk information (8 nodes)
Thu Jun 30 00:52:30 2022 * Verifying configuration file consistency
Thu Jun 30 00:52:30 2022 * Verifying node status
Thu Jun 30 00:52:30 2022 * Verifying instance status
Thu Jun 30 00:52:30 2022 * Verifying orphan volumes
Thu Jun 30 00:52:30 2022 * Verifying N+1 Memory redundancy
Thu Jun 30 00:52:30 2022   - ERROR: node fsn-node-02.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-03.torproject.org fail (20480MiB needed, 4922MiB available)
Thu Jun 30 00:52:30 2022   - ERROR: node fsn-node-02.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-04.torproject.org fail (16384MiB needed, 4922MiB available)
Thu Jun 30 00:52:30 2022 * Other Notes
Thu Jun 30 00:52:30 2022   - NOTICE: 1 drained node(s) found.
Thu Jun 30 00:52:31 2022 * Hooks Results

after:

Thu Jun 30 00:57:19 2022   - ERROR: node fsn-node-02.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-04.torproject.org fail (32768MiB needed, 21399MiB available)

not sure there's anything to do about this without really moving a lot of shit around, so i'm going to pretend we can survive this.

assigned to @lavamind

added Doing label and removed Backlog label

So I looked at fsn-node-01 this morning and nothing seemed amiss. The nvme error-log command doesn't show anything abnormal reported by the disks themselves, which appear healthy. If the disk or storage subsystem on this computer was faulty, it's likely DRBD would notice because even as secondaries the disks are still being kept busy.

I'd also mention that the ^@^@^@^@^@^@ garbage in /var/log/syslog is actually a bunch of zeroes, and this happens when a crash occurs as the logfile is being written to. It's not necessarily a telltale sign of hardware problems. So it's plausible the last crash was indeed caused by the VMs, possibly also worsened by the vswitch stuff going down?

Anyway, I agree with @kez we shouldn't just put the node back into production and pretend everything is OK, so I'd suggest we run memory diagnostics, and if everything looks OK, migrate one or two VMs back and wait a week.

While I was looking around I managed to trigger a new but harmless entry in the error log of /dev/nvme0, with the nvme get-log command and bogus parameters. So the latest warning sent via email can be ignored.

So I'd be ready to file a ticket with Hetzner and launch the memory diag using pcmemtest, but I'm not sure whether it would be OK for the DRBD secondaries to be offline for 24 hours or more while the test runs. We'd lose redundancy and the resync might take a long time when the node comes back up. So maybe I should move the secondaries off of fsn-node-01 before?

After discussing it with @kez we settled on moving the secondaries of only a few key instances off of fsn-node-01 to ensure redundancy for those:

bacula-director-01
eugeni
crm-int-01

Then we'll run the memtest.

The three instances have had their secondaries moved and pcmemtest has started.

To verify the results we'll likely have to put in a new request for a KVM with Hetzner tomorrow.

The memtest has completed 2 passes without errors.

The node is now rebooted and sync'ing its secondaries.

With the secondary DRBD volumes now in sync, I've un-drained fsn-node-01 and have started migrating static-master-fsn back to it. The plan is to let the node run that VM for a few days, and if that works, migrate another one.

Seems fair. I would however point out that one of the NVMe drives did disappear in the past. It returned after a reboot, but that is a disturbing pattern, especially when combined with the other failures we had in this cluster related to this ticket.

Maybe that's not something to take care of directly in this ticket, but I think we should seriously consider retiring the first batch of gnt-fsn machines. We could order two new fsn-node* boxes, set them up, and retire fsn-node-01 and -02 afterwards. This could be done in our precious spare time. I estimate it would take about 2 days of work, and would give us much more peace of mind to not have to deal with such hardware issues in the future. Hardware cost would be close to nil: ~150EUR setup fees per box, if i remember correctly.

And this is something we have to do at some point anyways in the future, so let's make sure we have that scheduled before we close this ticket, even if the above strategy is succesful in (slowly) making fsn-node-01 join back in the cluster.

a.

...

On 2022-07-06 15:45:14, Jérôme Charaoui (@lavamind) wrote:

With the secondary DRBD volumes now in sync, I've un-drained fsn-node-01 and have started migrating static-master-fsn back to it. The plan is to let the node run that VM for a few days, and if that works, migrate another one.

-- Antoine Beaupré torproject.org system administration

Seems fair. I would however point out that one of the NVMe drives did disappear in the past. It returned after a reboot, but that is a disturbing pattern, especially when combined with the other failures we had in this cluster related to this ticket.

Which of those other failures were established to be caused by hardware? As far as I know, none, so that's why I'm kind of on the fence about it.

Maybe that's not something to take care of directly in this ticket, but I think we should seriously consider retiring the first batch of gnt-fsn machines.

I'd be curious to know what's the reasoning here, because I don't consider server hardware to be "old" after 3 years of use, that seems arbitrarily short? Don't get me wrong I'm all in favor of getting rid of faulty hardware, but beside the thing with the NVMe disk, my understanding is the other failures were caused by misbehaving software.

Jérôme Charaoui commented on a discussion: #40818 (comment 2820349)

Seems fair. I would however point out that one of the NVMe drives did disappear in the past. It returned after a reboot, but that is a disturbing pattern, especially when combined with the other failures we had in this cluster related to this ticket.

Which of those other failures were established to be caused by hardware? As far as I know, none, so that's why I'm kind of on the fence about it.

a disk disappeaering from the bus completely certainly feels like a hardware failure.

i also haven't ruled out a software or usage problem on the issue here, but i find it really odd that all those VMs would simultaneously fail all at once, all on the same ganeti node. that feels hardware related to me, because it's specific to that one node.

Maybe that's not something to take care of directly in this ticket, but I think we should seriously consider retiring the first batch of gnt-fsn machines.

I'd be curious to know what's the reasoning here, because I don't consider server hardware to be "old" after 3 years of use, that seems arbitrarily short? Don't get me wrong I'm all in favor of getting rid of faulty hardware, but beside the thing with the NVMe disk, my understanding is the other failures were caused by misbehaving software.

I don't know... we don't actually know what the physical age of this server is. we set it up in 2019, but it could very well have been already setup long before. maybe something worth investigating.

...

On 2022-07-09 22:33:37, Jérôme Charaoui (@lavamind) wrote:

Antoine Beaupré torproject.org system administration

regarding the age of fsn-node-01, it's not older than Q3 2018, because that's when its CPU was shipped:

root@fsn-node-01:~# grep model\ name /proc/cpuinfo | sort -u
model name      : Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz

https://en.wikichip.org/wiki/intel/xeon_e/e-2176g https://ark.intel.com/content/www/us/en/ark/products/134860/intel-xeon-e2176g-processor-12m-cache-up-to-4-70-ghz.html

so i guess the server probably is 3 years old and might not be worth replacing just yet.

I looked into updating the firmware on the Samsung PM983 NMVe drives but Samsung doesn't appear to offer those on its storage support page, only the tool to install the update itself, the "Samsung DC Toolkit".

I followed up with a phone call to Samsung support about upgrading the firmware (I couldn't find any email address, go figure) and the response was the serial number on the drive doesn't end with a letter, and thus it's an OEM product that not eligible for any kind of support or updates from Samsung. I have no idea who the OEM supplier is for this part, only Hetzner would know, so I don't think this is really worth exploring further.

No issues in the last few days, so I migrated materculae back to fsn-node-01: the node now has 2 running instances. Bonus points, the cluster now verifies without errors.

I've migrated the primary for eugeni back to fsn-node-01. So far so good.

evaluate the health of fsn-node-01

Designs

Child items ...

Activity

On 2022-07-09 22:33:37, Jérôme Charaoui (@lavamind) wrote: