NVMe RAID disk failure on dragon.tails.net

On June 17 2025 at 12:20 PM, the Tails Team reported slowness on a specific Jenkins test that on tails#21032
One hour later, while running a short test on /dev/nvme1, the disk gave several I/O errors and was removed from the system.

Since then, we're running with a degraded RAID array with only one disk:

root@dragon:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 nvme0n1p3[0]
      1952405504 blocks super 1.2 [2/1] [U_]
      bitmap: 12/15 pages [48KB], 65536KB chunk

md0 : active raid1 nvme0n1p2[0]
      487424 blocks super 1.2 [2/1] [U_]
    
unused devices: <none>

In order to fix this, we need to:

EDIT: we stumbled into more hardware issues along the way, and right now i think that there's really something wrong with the machine instead of with the disks. So here's an updated todo list:

restore the Jenkins orchestrator somewhere reliable
restore the gitlab-runner somewhere reliable
recover the build history from one of the old disks
investigate and decide what to do about dragon
(re-)install new Jenkins agents (where?)

Edited Aug 05, 2025 by zen