NVMe RAID disk failure on dragon.tails.net
-
On June 17 2025 at 12:20 PM, the Tails Team reported slowness on a specific Jenkins test that on tails#21032
-
One hour later, while running a short test on
/dev/nvme1
, the disk gave several I/O errors and was removed from the system. -
Since then, we're running with a degraded RAID array with only one disk:
root@dragon:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 nvme0n1p3[0] 1952405504 blocks super 1.2 [2/1] [U_] bitmap: 12/15 pages [48KB], 65536KB chunk md0 : active raid1 nvme0n1p2[0] 487424 blocks super 1.2 [2/1] [U_] unused devices: <none>
In order to fix this, we need to:
-
come up with hardware specs for the new disk -
find out the costs for replacing -
approve the budget -
buy the new disk -
install the new disk -
restore the RAID array
EDIT: we stumbled into more hardware issues along the way, and right now i think that there's really something wrong with the machine instead of with the disks. So here's an updated todo list:
-
restore the Jenkins orchestrator somewhere reliable -
restore the gitlab-runner somewhere reliable -
recover the build history from one of the old disks -
investigate and decide what to do about dragon -
(re-)install new Jenkins agents (where?)
Edited by zen