reorder RAID docs, add external docs authored by anarcat's avatar anarcat
This should be slightly more readable
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
If a drive fails in a server, the procedure is essentially to open a If a drive fails in a server, the procedure is essentially to open a
ticket, wait for the drive change, partition and re-add it to the RAID ticket, wait for the drive change, partition and re-add it to the RAID
array. The following procdure assumes that `sda` failed and `sdb` is array. The following procedure assumes that `sda` failed and `sdb` is
good in a RAID-1 array, but can vary with other RAID configurations or good in a RAID-1 array, but can vary with other RAID configurations or
drive models. drive models.
...@@ -35,6 +35,12 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt ...@@ -35,6 +35,12 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt
# Hardware RAID # Hardware RAID
Note: we do not have hardware RAID servers, nor do we want any in the
future.
This documentation is kept only for historical reference, in case we
end up with hardware RAID arrays again.
## MegaCLI operation ## MegaCLI operation
Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid` Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid`
...@@ -197,7 +203,47 @@ currently not in use: ...@@ -197,7 +203,47 @@ currently not in use:
a0e32s0 465GiB a0d0 online errs: media:0 other:819 a0e32s0 465GiB a0d0 online errs: media:0 other:819
a0e32s1 465GiB a0d0 online errs: media:0 other:819 a0e32s1 465GiB a0d0 online errs: media:0 other:819
## Pager playbook ## References
Here are some external documentation links regarding hardware RAID setups:
* <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
* <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
* <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>
* <https://wikitech.wikimedia.org/wiki/MegaCli>
# SMART monitoring
Some servers will fail to properly detect disk drives in their SMART
configuration. In particular, `smartd` does not support:
* virtual disks (e.g. `/dev/nbd0`)
* MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
devices)
* out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)
The latter can be configured with the following snippet in
`/etc/smartd.conf`:
#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/cciss/c0d0 -d cciss,0
/dev/cciss/c0d0 -d cciss,1
/dev/cciss/c0d0 -d cciss,2
/dev/cciss/c0d0 -d cciss,3
/dev/cciss/c0d0 -d cciss,4
/dev/cciss/c0d0 -d cciss,5
Notice how the `DEVICESCAN` is commented out to be replaced by the
CCISS configuration. One line for each drive should be added (and no,
it does not autodetect all drives unfortunately). This hack was
deployed on `listera` which uses that hardware RAID.
Other hardware RAID controllers are better supported. For example, the
`megaraid` controller on `moly` was correctly detected by `smartd`
which accurately found a broken hard drive.
# Pager playbook
Prometheus should be monitoring hardware RAID on servers that support Prometheus should be monitoring hardware RAID on servers that support
it. This is normally auto-detected by the Prometheus node exporter. it. This is normally auto-detected by the Prometheus node exporter.
...@@ -205,7 +251,7 @@ it. This is normally auto-detected by the Prometheus node exporter. ...@@ -205,7 +251,7 @@ it. This is normally auto-detected by the Prometheus node exporter.
NOTE: those instructions are out of date and need to be rewritten for NOTE: those instructions are out of date and need to be rewritten for
Prometheus, see [tpo/tpa/prometheus-alerts#16](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/16). Prometheus, see [tpo/tpa/prometheus-alerts#16](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/16).
### Failed disk ## Failed disk
A normal RAID-1 Nagios check output looks like this: A normal RAID-1 Nagios check output looks like this:
...@@ -219,7 +265,7 @@ It actually has the numbers backwards: in the above situation, there ...@@ -219,7 +265,7 @@ It actually has the numbers backwards: in the above situation, there
was only *one* degraded drive, and 3 healthy ones. See above for how was only *one* degraded drive, and 3 healthy ones. See above for how
to restore a drive in a MegaRAID array. to restore a drive in a MegaRAID array.
### Disks with "other" errors ## Disks with "other" errors
The following warning may seem innocuous but actually reports that The following warning may seem innocuous but actually reports that
drives have "errors: drives have "errors:
...@@ -259,42 +305,12 @@ safely ignored. ...@@ -259,42 +305,12 @@ safely ignored.
[this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error [this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error
[Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier [Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier
# SMART monitoring # Other documentation
Some servers will fail to properly detect disk drives in their SMART
configuration. In particular, `smartd` does not support:
* virtual disks (e.g. `/dev/nbd0`)
* MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
devices)
* out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)
The latter can be configured with the following snippet in
`/etc/smartd.conf`:
#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/cciss/c0d0 -d cciss,0
/dev/cciss/c0d0 -d cciss,1
/dev/cciss/c0d0 -d cciss,2
/dev/cciss/c0d0 -d cciss,3
/dev/cciss/c0d0 -d cciss,4
/dev/cciss/c0d0 -d cciss,5
Notice how the `DEVICESCAN` is commented out to be replaced by the
CCISS configuration. One line for each drive should be added (and no,
it does not autodetect all drives unfortunately). This hack was
deployed on `listera` which uses that hardware RAID.
Other hardware RAID controllers are better supported. For example, the See also:
`megaraid` controller on `moly` was correctly detected by `smartd`
which accurately found a broken hard drive.
## References
Here are some external documentation links:
* <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid> - [LVM](howto/lvm)
* <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli> - [RAID wiki](https://archive.kernel.org/oldwiki/raid.wiki.kernel.org/) (archived)
* <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/> - [md(4) manual page](https://manpages.debian.org/bookworm/mdadm/md.4.en.html)
* <https://wikitech.wikimedia.org/wiki/MegaCli> - [mdadm(8) manual page](https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html)
- [md driver kernel documentation](https://docs.kernel.org/admin-guide/md.html)