Failed disk on fsn-node-01

changed the description

assigned to @anarcat

added Doing label

i migrated all instances off of fsn-node-01 except neriniflorum because @lavamind is upgrading fsn-node-05, its current secondary. i have also done a master-failover to fsn-node-02 so, for the time being, cluster commands should be ran there.

once fsn-node-05 is finished upgrading i'll finish the migration, mark fsn-node-01 offline, and file the upstream ticket for the disk replacement, which should take 2-4h.

last VM migrated, fsn-node-01 marked offline, filing upstream ticket.

update: upstream ticket is [Ticket#2022062103023487].

when the upstream maintenance is done, this needs to be done on fsn-node-02:

gnt-node add --readd fsn-node-01.torproject.org

we could also failover back to fsn-node-01 to avoid any future confusion.

i rebooted the node and the drive unexpectedly returned:

root@fsn-node-01:~# lsblk --nodeps  -o +SERIAL
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT SERIAL
sda       8:0    0   9.1T  0 disk            5950A00SFWEF
sdb       8:16   0   9.1T  0 disk            5950A015FWEF
nvme1n1 259:0    0 894.3G  0 disk            S437NX0M402699
nvme0n1 259:1    0 894.3G  0 disk            S437NX0M402532

drive didn't get re-added automatically, but i did, and it resync'd fine:

root@fsn-node-01:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md125 : active (auto-read-only) raid1 sda1[0] sdb1[1]
      9766303744 blocks super 1.2 [2/2] [UU]
      bitmap: 0/73 pages [0KB], 65536KB chunk

md126 : active raid1 nvme1n1p3[1]
      936512512 blocks super 1.2 [2/1] [_U]
      bitmap: 7/7 pages [28KB], 65536KB chunk

md127 : active raid1 nvme1n1p2[1]
      523712 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>
root@fsn-node-01:~# mdadm /dev/md127 --add /dev/nvme0n1p2 
mdadm: added /dev/nvme0n1p2
root@fsn-node-01:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md125 : active (auto-read-only) raid1 sda1[0] sdb1[1]
      9766303744 blocks super 1.2 [2/2] [UU]
      bitmap: 0/73 pages [0KB], 65536KB chunk

md126 : active raid1 nvme1n1p3[1]
      936512512 blocks super 1.2 [2/1] [_U]
      bitmap: 7/7 pages [28KB], 65536KB chunk

md127 : active raid1 nvme0n1p2[2] nvme1n1p2[1]
      523712 blocks super 1.2 [2/1] [_U]
      [===============>.....]  recovery = 76.6% (401792/523712) finish=0.0min speed=401792K/sec
      
unused devices: <none>
root@fsn-node-01:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md125 : active (auto-read-only) raid1 sda1[0] sdb1[1]
      9766303744 blocks super 1.2 [2/2] [UU]
      bitmap: 0/73 pages [0KB], 65536KB chunk

md126 : active raid1 nvme1n1p3[1]
      936512512 blocks super 1.2 [2/1] [_U]
      bitmap: 7/7 pages [28KB], 65536KB chunk

md127 : active raid1 nvme0n1p2[2] nvme1n1p2[1]
      523712 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
root@fsn-node-01:~# mdadm /dev/md126 --add /dev/nvme0n1p3
mdadm: re-added /dev/nvme0n1p3
root@fsn-node-01:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md125 : active (auto-read-only) raid1 sda1[0] sdb1[1]
      9766303744 blocks super 1.2 [2/2] [UU]
      bitmap: 0/73 pages [0KB], 65536KB chunk

md126 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
      936512512 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.6% (6464064/936512512) finish=65.0min speed=238144K/sec
      bitmap: 7/7 pages [28KB], 65536KB chunk

smart doesn't see any problems:

root@fsn-node-01:~# smartctl -H /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

started a smart test:

root@fsn-node-01:~# smartctl -t short /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
                    
NVMe device successfully opened

Use 'smartctl -a' (or '-x') to print SMART (and more) information

... but those don't actually do anything. i tried to dig information out of nvme-cli as well, but nothing stands out:

root@fsn-node-01:~# smartctl -l error /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          6     0  0x801a  0x4002  0x000            0     1     -
  1          5     0  0x2016  0x4002  0x000            0     1     -
  2          4     0  0x8019  0x4002  0x000            0     1     -
  3          3     0  0x5007  0x4002  0x000            0     -     -
root@fsn-node-01:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 35 C
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 3%
endurance group critical warning summary: 0
data_units_read                         : 253906277
data_units_written                      : 444575851
host_read_commands                      : 13137721237
host_write_commands                     : 14468337066
controller_busy_time                    : 17950
power_cycles                            : 10
power_on_hours                          : 25713
unsafe_shutdowns                        : 6
media_errors                            : 0
num_err_log_entries                     : 6
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1           : 35 C
Temperature Sensor 2           : 39 C
Temperature Sensor 3           : 42 C
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
root@fsn-node-01:~# nvme intel smart-log-add /dev/nvme0
Additional Smart Log for NVME device:nvme0 namespace-id:ffffffff
key                               normalized raw
program_fail_count              : 100%       0
erase_fail_count                : 100%       0
wear_leveling                   :  96%       min: 157, max: 364, avg: 256
end_to_end_error_detection_count: 100%       0
crc_error_count                 : 100%       196606
timed_workload_media_wear       : 100%       0.003%
timed_workload_host_reads       : 100%       47%
timed_workload_timer            : 100%       1542818 min
thermal_throttle_status         : 100%       100%, cnt: 0
retry_buffer_overflow_count     :   0%       0
pll_lock_loss_count             :   0%       0
nand_bytes_written              : 100%       sectors: 7479175
host_bytes_written              : 100%       sectors: 6783690

the nvme error log is mostly full of stuff like this:

 Entry[63]   
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................

... doesn't look like errors at all.. bah.

not sure what to do next here.

this just in:

From: root <root@fsn-node-01.torproject.org> (7 mins. ago) (rapports root tor unread)
Subject: SMART error (ErrorCount) detected on host: fsn-node-01
To: root@fsn-node-01.torproject.org
Date: Tue, 21 Jun 2022 15:46:14 +0000

This message was generated by the smartd daemon running on:

   host name:  fsn-node-01
   DNS domain: torproject.org

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 2 to 6

Device info:
SAMSUNG MZQLB960HAJR-00007, S/N:S437NX0M402532, FW:EDA5202Q, 960 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

... but those errors might have been those four commands I ran on the device:

root@fsn-node-01:~# nvme device-self-test /dev/nvme0
NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)
root@fsn-node-01:~# nvme device-self-test /dev/nvme0 -n 1 -s 1
NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)
root@fsn-node-01:~# nvme device-self-test /dev/nvme0 -n 1 -s2
NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)
root@fsn-node-01:~# nvme device-self-test /dev/nvme0 -n 1 -s 2
NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x2001)

... which match the error log I get here:

root@fsn-node-01:~# nvme error-log /dev/nvme0
Error Log Entries for device:nvme0 entries:64
[...]
 Entry[ 3]   
.................
error_count     : 6
sqid            : 0
cmdid           : 0x801a
status_field    : 0x4002(INVALID_OPCODE: The associated command opcode field is not valid)
parm_err_loc    : 0
lba             : 0
nsid            : 0x1
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
[...]

after telling hetzner to hold, they have actually closed the ticket. considering i haven't found any flaw with the NVMe controller and we have redundancy like hell in that cluster, I don't think we should worry about this any further.

i've flipped the master back to fsn-node-01 and put it back online, i'm migrating its VMs back as we speak and I consider this incident closed.

closed

changed the incident status to Resolved by closing the incident

changed the severity to High - S2

mentioned in commit wiki-replica@fd104c15

mentioned in issue #40818 (closed)

marked this issue as related to #41168 (closed)

mentioned in issue #41168 (closed)

Failed disk on fsn-node-01

Child items ...

Activity