Loading howto/raid.md +57 −41 Original line number Diff line number Diff line Loading @@ -6,7 +6,7 @@ If a drive fails in a server, the procedure is essentially to open a ticket, wait for the drive change, partition and re-add it to the RAID array. The following procdure assumes that `sda` failed and `sdb` is array. The following procedure assumes that `sda` failed and `sdb` is good in a RAID-1 array, but can vary with other RAID configurations or drive models. Loading Loading @@ -35,6 +35,12 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt # Hardware RAID Note: we do not have hardware RAID servers, nor do we want any in the future. This documentation is kept only for historical reference, in case we end up with hardware RAID arrays again. ## MegaCLI operation Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid` Loading Loading @@ -197,7 +203,47 @@ currently not in use: a0e32s0 465GiB a0d0 online errs: media:0 other:819 a0e32s1 465GiB a0d0 online errs: media:0 other:819 ## Pager playbook ## References Here are some external documentation links regarding hardware RAID setups: * <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid> * <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli> * <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/> * <https://wikitech.wikimedia.org/wiki/MegaCli> # SMART monitoring Some servers will fail to properly detect disk drives in their SMART configuration. In particular, `smartd` does not support: * virtual disks (e.g. `/dev/nbd0`) * MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM devices) * out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`) The latter can be configured with the following snippet in `/etc/smartd.conf`: #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/cciss/c0d0 -d cciss,0 /dev/cciss/c0d0 -d cciss,1 /dev/cciss/c0d0 -d cciss,2 /dev/cciss/c0d0 -d cciss,3 /dev/cciss/c0d0 -d cciss,4 /dev/cciss/c0d0 -d cciss,5 Notice how the `DEVICESCAN` is commented out to be replaced by the CCISS configuration. One line for each drive should be added (and no, it does not autodetect all drives unfortunately). This hack was deployed on `listera` which uses that hardware RAID. Other hardware RAID controllers are better supported. For example, the `megaraid` controller on `moly` was correctly detected by `smartd` which accurately found a broken hard drive. # Pager playbook Prometheus should be monitoring hardware RAID on servers that support it. This is normally auto-detected by the Prometheus node exporter. Loading @@ -205,7 +251,7 @@ it. This is normally auto-detected by the Prometheus node exporter. NOTE: those instructions are out of date and need to be rewritten for Prometheus, see [tpo/tpa/prometheus-alerts#16](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/16). ### Failed disk ## Failed disk A normal RAID-1 Nagios check output looks like this: Loading @@ -219,7 +265,7 @@ It actually has the numbers backwards: in the above situation, there was only *one* degraded drive, and 3 healthy ones. See above for how to restore a drive in a MegaRAID array. ### Disks with "other" errors ## Disks with "other" errors The following warning may seem innocuous but actually reports that drives have "errors: Loading Loading @@ -259,42 +305,12 @@ safely ignored. [this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error [Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier # SMART monitoring Some servers will fail to properly detect disk drives in their SMART configuration. In particular, `smartd` does not support: * virtual disks (e.g. `/dev/nbd0`) * MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM devices) * out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`) The latter can be configured with the following snippet in `/etc/smartd.conf`: #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/cciss/c0d0 -d cciss,0 /dev/cciss/c0d0 -d cciss,1 /dev/cciss/c0d0 -d cciss,2 /dev/cciss/c0d0 -d cciss,3 /dev/cciss/c0d0 -d cciss,4 /dev/cciss/c0d0 -d cciss,5 Notice how the `DEVICESCAN` is commented out to be replaced by the CCISS configuration. One line for each drive should be added (and no, it does not autodetect all drives unfortunately). This hack was deployed on `listera` which uses that hardware RAID. # Other documentation Other hardware RAID controllers are better supported. For example, the `megaraid` controller on `moly` was correctly detected by `smartd` which accurately found a broken hard drive. ## References Here are some external documentation links: See also: * <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid> * <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli> * <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/> * <https://wikitech.wikimedia.org/wiki/MegaCli> - [LVM](howto/lvm) - [RAID wiki](https://archive.kernel.org/oldwiki/raid.wiki.kernel.org/) (archived) - [md(4) manual page](https://manpages.debian.org/bookworm/mdadm/md.4.en.html) - [mdadm(8) manual page](https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html) - [md driver kernel documentation](https://docs.kernel.org/admin-guide/md.html) Loading
howto/raid.md +57 −41 Original line number Diff line number Diff line Loading @@ -6,7 +6,7 @@ If a drive fails in a server, the procedure is essentially to open a ticket, wait for the drive change, partition and re-add it to the RAID array. The following procdure assumes that `sda` failed and `sdb` is array. The following procedure assumes that `sda` failed and `sdb` is good in a RAID-1 array, but can vary with other RAID configurations or drive models. Loading Loading @@ -35,6 +35,12 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt # Hardware RAID Note: we do not have hardware RAID servers, nor do we want any in the future. This documentation is kept only for historical reference, in case we end up with hardware RAID arrays again. ## MegaCLI operation Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid` Loading Loading @@ -197,7 +203,47 @@ currently not in use: a0e32s0 465GiB a0d0 online errs: media:0 other:819 a0e32s1 465GiB a0d0 online errs: media:0 other:819 ## Pager playbook ## References Here are some external documentation links regarding hardware RAID setups: * <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid> * <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli> * <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/> * <https://wikitech.wikimedia.org/wiki/MegaCli> # SMART monitoring Some servers will fail to properly detect disk drives in their SMART configuration. In particular, `smartd` does not support: * virtual disks (e.g. `/dev/nbd0`) * MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM devices) * out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`) The latter can be configured with the following snippet in `/etc/smartd.conf`: #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/cciss/c0d0 -d cciss,0 /dev/cciss/c0d0 -d cciss,1 /dev/cciss/c0d0 -d cciss,2 /dev/cciss/c0d0 -d cciss,3 /dev/cciss/c0d0 -d cciss,4 /dev/cciss/c0d0 -d cciss,5 Notice how the `DEVICESCAN` is commented out to be replaced by the CCISS configuration. One line for each drive should be added (and no, it does not autodetect all drives unfortunately). This hack was deployed on `listera` which uses that hardware RAID. Other hardware RAID controllers are better supported. For example, the `megaraid` controller on `moly` was correctly detected by `smartd` which accurately found a broken hard drive. # Pager playbook Prometheus should be monitoring hardware RAID on servers that support it. This is normally auto-detected by the Prometheus node exporter. Loading @@ -205,7 +251,7 @@ it. This is normally auto-detected by the Prometheus node exporter. NOTE: those instructions are out of date and need to be rewritten for Prometheus, see [tpo/tpa/prometheus-alerts#16](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/16). ### Failed disk ## Failed disk A normal RAID-1 Nagios check output looks like this: Loading @@ -219,7 +265,7 @@ It actually has the numbers backwards: in the above situation, there was only *one* degraded drive, and 3 healthy ones. See above for how to restore a drive in a MegaRAID array. ### Disks with "other" errors ## Disks with "other" errors The following warning may seem innocuous but actually reports that drives have "errors: Loading Loading @@ -259,42 +305,12 @@ safely ignored. [this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error [Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier # SMART monitoring Some servers will fail to properly detect disk drives in their SMART configuration. In particular, `smartd` does not support: * virtual disks (e.g. `/dev/nbd0`) * MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM devices) * out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`) The latter can be configured with the following snippet in `/etc/smartd.conf`: #DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner /dev/cciss/c0d0 -d cciss,0 /dev/cciss/c0d0 -d cciss,1 /dev/cciss/c0d0 -d cciss,2 /dev/cciss/c0d0 -d cciss,3 /dev/cciss/c0d0 -d cciss,4 /dev/cciss/c0d0 -d cciss,5 Notice how the `DEVICESCAN` is commented out to be replaced by the CCISS configuration. One line for each drive should be added (and no, it does not autodetect all drives unfortunately). This hack was deployed on `listera` which uses that hardware RAID. # Other documentation Other hardware RAID controllers are better supported. For example, the `megaraid` controller on `moly` was correctly detected by `smartd` which accurately found a broken hard drive. ## References Here are some external documentation links: See also: * <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid> * <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli> * <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/> * <https://wikitech.wikimedia.org/wiki/MegaCli> - [LVM](howto/lvm) - [RAID wiki](https://archive.kernel.org/oldwiki/raid.wiki.kernel.org/) (archived) - [md(4) manual page](https://manpages.debian.org/bookworm/mdadm/md.4.en.html) - [mdadm(8) manual page](https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html) - [md driver kernel documentation](https://docs.kernel.org/admin-guide/md.html)