disk failure on dal-rescue-01

First diagnostic

looks like a disk failed on dal-rescue-01.

follow raid, consider cross-shipping dal-rescue-02.

there was also a warning about HTTPS not being reachable

Alerts and mails, click to expand
Date: Sun, 04 Jan 2026 19:14:21 +0000
From: alertmanager@prometheus-03.torproject.org
To: torproject-admin@torproject.org
Reply-To: tpa-team@lists.torproject.org
Subject: RAIDDegraded RAID array on dal-rescue-01.torproject.org is degraded

Total firing alerts: 1

## Firing Alerts

-----
Time: 2026-01-04 19:13:51.784 +0000 UTC
Summary: RAID array on dal-rescue-01.torproject.org is degraded
Description: The md1 RAID array on dal-rescue-01.torproject.org has failed: 1 disks failed in device md1
playbook: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/raid#failed-disk
-----
Date: Sun, 04 Jan 2026 19:12:57 +0000
From: mdadm monitoring <root@dal-rescue-01>
To: root@dal-rescue-01.torproject.org
Subject: Fail event on /dev/md/1:dal-rescue-01

This is an automatically generated mail message.
Fail event detected on md device /dev/md/1, component device /dev/sda3
The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[3] mmcblk0p2[2]
      306176 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda3[3](F) mmcblk0p3[2]
      3564544 blocks super 1.2 [2/1] [U_]

unused devices: <none>

Date: Sun, 04 Jan 2026 20:15:24 +0000
From: mdadm monitoring <root@dal-rescue-01>
To: root@dal-rescue-01.torproject.org
Subject: Fail event on /dev/md/0:dal-rescue-01

This is an automatically generated mail message.
Fail event detected on md device /dev/md/0, component device /dev/sda2
The /proc/mdstat file currently contains the following:

Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 mmcblk0p2[2]
      306176 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda3[3](F) mmcblk0p3[2]
      3564544 blocks super 1.2 [2/1] [U_]

unused devices: <none>

Current status

Roles

  • Lead: unless otherwise noted, the issue assignee
  • Operations:
  • Communications:
  • Planning:

Next steps

  • inspect state of ejected disk. try re-adding disk to array to see if it is accepted
  • if problem persists ship dal-rescue-02 to the dallas DC. once rescue-02 is at the datacenter get rescue-01 shipped back so we can get it fixed locally
    • in this case, the network configuration of dal-rescue-02 needs to be adjusted before it is shipped

Dashboards


Post-mortem

Detailed post-mortem to fill in later, click to expand
  • Affected users:
  • Duration:
  • Status page link:
  • Report Status: not started

Timeline

Root cause analysis

What went well?

What could have gone better?

Recommendations and related issues

Edited Jan 05, 2026 by lelutin
Assignee Loading
Time tracking Loading