i migrated all instances off of fsn-node-01 except neriniflorum because @lavamind is upgrading fsn-node-05, its current secondary. i have also done a master-failover to fsn-node-02 so, for the time being, cluster commands should be ran there.
once fsn-node-05 is finished upgrading i'll finish the migration, mark fsn-node-01 offline, and file the upstream ticket for the disk replacement, which should take 2-4h.
the nvme error log is mostly full of stuff like this:
Entry[63] .................error_count : 0sqid : 0cmdid : 0status_field : 0(SUCCESS: The command completed successfully)parm_err_loc : 0lba : 0nsid : 0vs : 0trtype : The transport type is not indicated or the error is not transport related.cs : 0trtype_spec_info: 0.................
From: root <root@fsn-node-01.torproject.org> (7 mins. ago) (rapports root tor unread)Subject: SMART error (ErrorCount) detected on host: fsn-node-01To: root@fsn-node-01.torproject.orgDate: Tue, 21 Jun 2022 15:46:14 +0000This message was generated by the smartd daemon running on: host name: fsn-node-01 DNS domain: torproject.orgThe following warning/error was logged by the smartd daemon:Device: /dev/nvme0, number of Error Log entries increased from 2 to 6Device info:SAMSUNG MZQLB960HAJR-00007, S/N:S437NX0M402532, FW:EDA5202Q, 960 GBFor details see host's SYSLOG.You can also use the smartctl utility for further investigation.Another message will be sent in 24 hours if the problem persists.
... but those errors might have been those four commands I ran on the device:
root@fsn-node-01:~# nvme device-self-test /dev/nvme0NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)root@fsn-node-01:~# nvme device-self-test /dev/nvme0 -n 1 -s 1NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)root@fsn-node-01:~# nvme device-self-test /dev/nvme0 -n 1 -s2NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)root@fsn-node-01:~# nvme device-self-test /dev/nvme0 -n 1 -s 2NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)
... which match the error log I get here:
root@fsn-node-01:~# nvme error-log /dev/nvme0Error Log Entries for device:nvme0 entries:64[...] Entry[ 3] .................error_count : 6sqid : 0cmdid : 0x801astatus_field : 0x4002(INVALID_OPCODE: The associated command opcode field is not valid)parm_err_loc : 0lba : 0nsid : 0x1vs : 0trtype : The transport type is not indicated or the error is not transport related.cs : 0trtype_spec_info: 0.................[...]
after telling hetzner to hold, they have actually closed the ticket. considering i haven't found any flaw with the NVMe controller and we have redundancy like hell in that cluster, I don't think we should worry about this any further.
i've flipped the master back to fsn-node-01 and put it back online, i'm migrating its VMs back as we speak and I consider this incident closed.