Verified Commit 1daba1fc authored by anarcat's avatar anarcat
Browse files

reorder RAID docs, add external docs

This should be slightly more readable
parent d8553896
Loading
Loading
Loading
Loading
+57 −41
Original line number Diff line number Diff line
@@ -6,7 +6,7 @@

If a drive fails in a server, the procedure is essentially to open a
ticket, wait for the drive change, partition and re-add it to the RAID
array. The following procdure assumes that `sda` failed and `sdb` is
array. The following procedure assumes that `sda` failed and `sdb` is
good in a RAID-1 array, but can vary with other RAID configurations or
drive models.

@@ -35,6 +35,12 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt

# Hardware RAID

Note: we do not have hardware RAID servers, nor do we want any in the
future.

This documentation is kept only for historical reference, in case we
end up with hardware RAID arrays again.

## MegaCLI operation

Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid`
@@ -197,7 +203,47 @@ currently not in use:
    a0e32s0     465GiB  a0d0  online   errs: media:0  other:819
    a0e32s1     465GiB  a0d0  online   errs: media:0  other:819

## Pager playbook
## References

Here are some external documentation links regarding hardware RAID setups:

 * <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
 * <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
 * <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>
 * <https://wikitech.wikimedia.org/wiki/MegaCli>

# SMART monitoring

Some servers will fail to properly detect disk drives in their SMART
configuration. In particular, `smartd` does not support:

 * virtual disks (e.g. `/dev/nbd0`)
 * MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
   devices)
 * out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)

The latter can be configured with the following snippet in
`/etc/smartd.conf`:

    #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
    DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
    /dev/cciss/c0d0 -d cciss,0
    /dev/cciss/c0d0 -d cciss,1
    /dev/cciss/c0d0 -d cciss,2
    /dev/cciss/c0d0 -d cciss,3
    /dev/cciss/c0d0 -d cciss,4
    /dev/cciss/c0d0 -d cciss,5

Notice how the `DEVICESCAN` is commented out to be replaced by the
CCISS configuration. One line for each drive should be added (and no,
it does not autodetect all drives unfortunately). This hack was
deployed on `listera` which uses that hardware RAID.

Other hardware RAID controllers are better supported. For example, the
`megaraid` controller on `moly` was correctly detected by `smartd`
which accurately found a broken hard drive.

# Pager playbook

Prometheus should be monitoring hardware RAID on servers that support
it. This is normally auto-detected by the Prometheus node exporter.
@@ -205,7 +251,7 @@ it. This is normally auto-detected by the Prometheus node exporter.
NOTE: those instructions are out of date and need to be rewritten for
Prometheus, see [tpo/tpa/prometheus-alerts#16](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/16).

### Failed disk
## Failed disk

A normal RAID-1 Nagios check output looks like this:

@@ -219,7 +265,7 @@ It actually has the numbers backwards: in the above situation, there
was only *one* degraded drive, and 3 healthy ones. See above for how
to restore a drive in a MegaRAID array.

### Disks with "other" errors
## Disks with "other" errors

The following warning may seem innocuous but actually reports that
drives have "errors:
@@ -259,42 +305,12 @@ safely ignored.
[this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error
[Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier

# SMART monitoring

Some servers will fail to properly detect disk drives in their SMART
configuration. In particular, `smartd` does not support:

 * virtual disks (e.g. `/dev/nbd0`)
 * MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
   devices)
 * out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)

The latter can be configured with the following snippet in
`/etc/smartd.conf`:

    #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
    DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
    /dev/cciss/c0d0 -d cciss,0
    /dev/cciss/c0d0 -d cciss,1
    /dev/cciss/c0d0 -d cciss,2
    /dev/cciss/c0d0 -d cciss,3
    /dev/cciss/c0d0 -d cciss,4
    /dev/cciss/c0d0 -d cciss,5

Notice how the `DEVICESCAN` is commented out to be replaced by the
CCISS configuration. One line for each drive should be added (and no,
it does not autodetect all drives unfortunately). This hack was
deployed on `listera` which uses that hardware RAID.
# Other documentation

Other hardware RAID controllers are better supported. For example, the
`megaraid` controller on `moly` was correctly detected by `smartd`
which accurately found a broken hard drive.

## References

Here are some external documentation links:
See also:

 * <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
 * <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
 * <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>
 * <https://wikitech.wikimedia.org/wiki/MegaCli>
- [LVM](howto/lvm)
- [RAID wiki](https://archive.kernel.org/oldwiki/raid.wiki.kernel.org/) (archived)
- [md(4) manual page](https://manpages.debian.org/bookworm/mdadm/md.4.en.html)
- [mdadm(8) manual page](https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html)
- [md driver kernel documentation](https://docs.kernel.org/admin-guide/md.html)