Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[[!toc levels=3]]
Software RAID
=============
Replacing a drive
-----------------
If a drive fails in a server, the procedure is essentially to open a
ticket, wait for the drive change, partition and re-add it to the RAID
array. The following procdure assumes that `sda` failed and `sdb` is
good in a RAID-1 array, but can vary with other RAID configurations or
drive models.
1. file a ticket upstream
[Hetzner Support](https://robot.your-server.de/support/), for example, has an excellent service which
asks you the disk serial number (available in the SMART email
notification) and the SMART log (output of `smartctl -x
/dev/sda`). Then they will turn off the machine, replace the disk,
and start it up again.
2. wait for the server to return with the new disk
Hetzner will send an email to the tpa alias when that is done.
3. partition the new drive (`sda`) to match the old (`sdb`):
sfdisk -d /dev/sdb | sfdisk --no-reread /dev/sda --force
4. re-add the new disk to the RAID array:
mdadm /dev/md0 -a /dev/sda
Note that Hetzner also has [pretty good documentation on how to deal
with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatten_und_Hinweise_zu_defekten_Festplatten/en).
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
Hardware RAID
=============
Some TPO machines have hardware RAID with `megaraid`
controllers. Those are controlled with the `MegaCLI` command that is
... rather hard to use.
First, alias the megacli command because the package (derived from the
upstream RPM by Alien) installs it in a strange location:
alias megacli=/opt/MegaRAID/MegaCli/MegaCli
This will confirm you are using hardware raid:
root@moly:/home/anarcat# lspci | grep -i raid
05:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
This will show the RAID levels of each enclosure, for example this is
RAID-10:
root@moly:/home/anarcat# megacli -LdPdInfo -aALL | grep "RAID Level"
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
This lists a summary of all the disks, for example the first disk has
failed here:
root@moly:/home/anarcat# megacli -PDList -aALL | grep -e '^Enclosure' -e '^Slot' -e '^PD' -e '^Firmware' -e '^Raw' -e '^Inquiry'
Enclosure Device ID: 252
Slot Number: 0
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Failed
Inquiry Data: SEAGATE ST3600057SS [REDACTED]
Enclosure Device ID: 252
Slot Number: 1
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST3600057SS [REDACTED]
Enclosure Device ID: 252
Slot Number: 2
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST3600057SS [REDACTED]
Enclosure Device ID: 252
Slot Number: 3
Enclosure position: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST3600057SS [REDACTED]
This will make the drive blink (slot number 0 in enclosure 252):
megacli -PdLocate -start -physdrv[252:0] -aALL
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
SMART monitoring
----------------
Some servers will fail to properly detect disk drives in their SMART
configuration. In particular, `smartd` does not support:
* virtual disks (e.g. `/dev/nbd0`)
* MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
devices)
* out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)
The latter can be configured with the following snippet in
`/etc/smartd.conf`:
#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/cciss/c0d0 -d cciss,0
/dev/cciss/c0d0 -d cciss,1
/dev/cciss/c0d0 -d cciss,2
/dev/cciss/c0d0 -d cciss,3
/dev/cciss/c0d0 -d cciss,4
/dev/cciss/c0d0 -d cciss,5
Notice how the `DEVICESCAN` is commented out to be replaced by the
CCISS configuration. One line for each drive should be added (and no,
it does not autodetect all drives unfortunately). This hack was
deployed on `listera` which uses that hardware RAID.
Other hardware RAID controllers are better supported. For example, the
`megaraid` controller on `moly` was correctly detected by `smartd`
which accurately found a broken hard drive.
References
----------
Here are some external documentation links:
* <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
* <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
* <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>