NVMe RAID disk failure on dragon.tails.net

  • On June 17 2025 at 12:20 PM, the Tails Team reported slowness on a specific Jenkins test that on tails#21032

  • One hour later, while running a short test on /dev/nvme1, the disk gave several I/O errors and was removed from the system.

  • Since then, we're running with a degraded RAID array with only one disk:

    root@dragon:~# cat /proc/mdstat 
    Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
    md1 : active raid1 nvme0n1p3[0]
          1952405504 blocks super 1.2 [2/1] [U_]
          bitmap: 12/15 pages [48KB], 65536KB chunk
    
    md0 : active raid1 nvme0n1p2[0]
          487424 blocks super 1.2 [2/1] [U_]
        
    unused devices: <none>

In order to fix this, we need to:

  • come up with hardware specs for the new disk
  • find out the costs for replacing
  • approve the budget
  • buy the new disk
  • install the new disk
  • restore the RAID array

EDIT: we stumbled into more hardware issues along the way, and right now i think that there's really something wrong with the machine instead of with the disks. So here's an updated todo list:

  • restore the Jenkins orchestrator somewhere reliable
  • restore the gitlab-runner somewhere reliable
  • recover the build history from one of the old disks
  • investigate and decide what to do about dragon
  • (re-)install new Jenkins agents (where?)
Edited Aug 05, 2025 by zen
Assignee Loading
Time tracking Loading