Changes
Page history
reorder RAID docs, add external docs
authored
Feb 07, 2025
by
anarcat
This should be slightly more readable
Hide whitespace changes
Inline
Side-by-side
howto/raid.md
View page @
1daba1fc
...
...
@@ -6,7 +6,7 @@
If a drive fails in a server, the procedure is essentially to open a
ticket, wait for the drive change, partition and re-add it to the RAID
array. The following procdure assumes that
`sda`
failed and
`sdb`
is
array. The following proc
e
dure assumes that
`sda`
failed and
`sdb`
is
good in a RAID-1 array, but can vary with other RAID configurations or
drive models.
...
...
@@ -35,6 +35,12 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt
# Hardware RAID
Note: we do not have hardware RAID servers, nor do we want any in the
future.
This documentation is kept only for historical reference, in case we
end up with hardware RAID arrays again.
## MegaCLI operation
Some TPO machines --particularly
[
at cymru
](
howto/new-machine-cymru
)
-- have hardware RAID with
`megaraid`
...
...
@@ -197,7 +203,47 @@ currently not in use:
a0e32s0 465GiB a0d0 online errs: media:0 other:819
a0e32s1 465GiB a0d0 online errs: media:0 other:819
## Pager playbook
## References
Here are some external documentation links regarding hardware RAID setups:
*
<https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
*
<https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
*
<https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>
*
<https://wikitech.wikimedia.org/wiki/MegaCli>
# SMART monitoring
Some servers will fail to properly detect disk drives in their SMART
configuration. In particular,
`smartd`
does not support:
*
virtual disks (e.g.
`/dev/nbd0`
)
*
MMC block devices (e.g.
`/dev/mmcblk0`
, commonly found on ARM
devices)
*
out of the box, CCISS raid devices (e.g.
`/dev/cciss/c0d0`
)
The latter can be configured with the following snippet in
`/etc/smartd.conf`
:
#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/cciss/c0d0 -d cciss,0
/dev/cciss/c0d0 -d cciss,1
/dev/cciss/c0d0 -d cciss,2
/dev/cciss/c0d0 -d cciss,3
/dev/cciss/c0d0 -d cciss,4
/dev/cciss/c0d0 -d cciss,5
Notice how the
`DEVICESCAN`
is commented out to be replaced by the
CCISS configuration. One line for each drive should be added (and no,
it does not autodetect all drives unfortunately). This hack was
deployed on
`listera`
which uses that hardware RAID.
Other hardware RAID controllers are better supported. For example, the
`megaraid`
controller on
`moly`
was correctly detected by
`smartd`
which accurately found a broken hard drive.
# Pager playbook
Prometheus should be monitoring hardware RAID on servers that support
it. This is normally auto-detected by the Prometheus node exporter.
...
...
@@ -205,7 +251,7 @@ it. This is normally auto-detected by the Prometheus node exporter.
NOTE: those instructions are out of date and need to be rewritten for
Prometheus, see
[
tpo/tpa/prometheus-alerts#16
](
https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/16
)
.
##
#
Failed disk
## Failed disk
A normal RAID-1 Nagios check output looks like this:
...
...
@@ -219,7 +265,7 @@ It actually has the numbers backwards: in the above situation, there
was only
*one*
degraded drive, and 3 healthy ones. See above for how
to restore a drive in a MegaRAID array.
##
#
Disks with "other" errors
## Disks with "other" errors
The following warning may seem innocuous but actually reports that
drives have "errors:
...
...
@@ -259,42 +305,12 @@ safely ignored.
[
this discussion
]:
https://serverfault.com/questions/482705/megacli-causes-drive-other-error
[
Key Code Qualifier
]:
https://en.wikipedia.org/wiki/Key_Code_Qualifier
# SMART monitoring
Some servers will fail to properly detect disk drives in their SMART
configuration. In particular,
`smartd`
does not support:
*
virtual disks (e.g.
`/dev/nbd0`
)
*
MMC block devices (e.g.
`/dev/mmcblk0`
, commonly found on ARM
devices)
*
out of the box, CCISS raid devices (e.g.
`/dev/cciss/c0d0`
)
The latter can be configured with the following snippet in
`/etc/smartd.conf`
:
#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/cciss/c0d0 -d cciss,0
/dev/cciss/c0d0 -d cciss,1
/dev/cciss/c0d0 -d cciss,2
/dev/cciss/c0d0 -d cciss,3
/dev/cciss/c0d0 -d cciss,4
/dev/cciss/c0d0 -d cciss,5
Notice how the
`DEVICESCAN`
is commented out to be replaced by the
CCISS configuration. One line for each drive should be added (and no,
it does not autodetect all drives unfortunately). This hack was
deployed on
`listera`
which uses that hardware RAID.
# Other documentation
Other hardware RAID controllers are better supported. For example, the
`megaraid`
controller on
`moly`
was correctly detected by
`smartd`
which accurately found a broken hard drive.
## References
Here are some external documentation links:
See also:
*
<https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
*
<https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
*
<https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>
*
<https://wikitech.wikimedia.org/wiki/MegaCli>
-
[
LVM
](
howto/lvm
)
-
[
RAID wiki
](
https://archive.kernel.org/oldwiki/raid.wiki.kernel.org/
)
(
archived
)
-
[
md(4) manual page
](
https://manpages.debian.org/bookworm/mdadm/md.4.en.html
)
-
[
mdadm(8) manual page
](
https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html
)
-
[
md driver kernel documentation
](
https://docs.kernel.org/admin-guide/md.html
)