silence or resolve SMART warnings from fsn-node-08 and dal-rescue-01
Since July 17th 2023, we're getting daily warnings like those:
Date: Mon, 17 Jul 2023 18:05:45 +0000
From: root <root@fsn-node-08.torproject.org>
To: root@fsn-node-08.torproject.org
Subject: SMART error (CurrentPendingSector) detected on host: fsn-node-08
This message was generated by the smartd daemon running on:
host name: fsn-node-08
DNS domain: torproject.org
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 8 Currently unreadable (pending) sectors
Device info:
HGST HUS726060ALE610, S/N:NCGW08PV, WWN:5-000cca-24dcc4712, FW:APGNTD05, 6.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.
At first it was only fsn-node-08 exhibiting this behavior, but now dal-rescue-01 is doing so as well, since August 1st 2023.
On July 31st, @lavamind performed an extended self-test on the affected drive on fsn-node-08 and didn't find anything:
Date: Tue, 01 Aug 2023 11:13:10 -0400
From: Jérôme Charaoui <lavamind@torproject.org>
To: root <root@fsn-node-08.torproject.org>
Subject: Re: SMART error (CurrentPendingSector) detected on host: fsn-node-08
The result of the test is as follows:
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 59408
-
Although Current_Pending_Sector remains at 8, I think we can ignore the
issue for now, as the drive's extended selftest is not reporting any errors.
Still, those daily emails are quite annoying, and cause alert fatigue. Even if we assume that the disks are fine (and we shouldn't: I feel this might be a sign of performance degradation on those drives, for example), the recurring warnings might cause us to ignore real SMART errors that we should definitely look into.