reorder RAID docs, add external docs (1daba1fc) · Commits · The Tor Project / TPA / Wiki Replica

howto/raid.md

+57 −41

Original line number	Diff line number	Diff line
		@@ -6,7 +6,7 @@

		If a drive fails in a server, the procedure is essentially to open a
		ticket, wait for the drive change, partition and re-add it to the RAID
		array. The following procdure assumes that `sda` failed and `sdb` is
		array. The following procedure assumes that `sda` failed and `sdb` is
		good in a RAID-1 array, but can vary with other RAID configurations or
		drive models.

		@@ -35,6 +35,12 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt

		# Hardware RAID

		Note: we do not have hardware RAID servers, nor do we want any in the
		future.

		This documentation is kept only for historical reference, in case we
		end up with hardware RAID arrays again.

		## MegaCLI operation

		Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid`
		@@ -197,7 +203,47 @@ currently not in use:
		a0e32s0 465GiB a0d0 online errs: media:0 other:819
		a0e32s1 465GiB a0d0 online errs: media:0 other:819

		## Pager playbook
		## References

		Here are some external documentation links regarding hardware RAID setups:

		* <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
		* <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
		* <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>
		* <https://wikitech.wikimedia.org/wiki/MegaCli>

		# SMART monitoring

		Some servers will fail to properly detect disk drives in their SMART
		configuration. In particular, `smartd` does not support:

		* virtual disks (e.g. `/dev/nbd0`)
		* MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
		devices)
		* out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)

		The latter can be configured with the following snippet in
		`/etc/smartd.conf`:

		#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
		DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
		/dev/cciss/c0d0 -d cciss,0
		/dev/cciss/c0d0 -d cciss,1
		/dev/cciss/c0d0 -d cciss,2
		/dev/cciss/c0d0 -d cciss,3
		/dev/cciss/c0d0 -d cciss,4
		/dev/cciss/c0d0 -d cciss,5

		Notice how the `DEVICESCAN` is commented out to be replaced by the
		CCISS configuration. One line for each drive should be added (and no,
		it does not autodetect all drives unfortunately). This hack was
		deployed on `listera` which uses that hardware RAID.

		Other hardware RAID controllers are better supported. For example, the
		`megaraid` controller on `moly` was correctly detected by `smartd`
		which accurately found a broken hard drive.

		# Pager playbook

		Prometheus should be monitoring hardware RAID on servers that support
		it. This is normally auto-detected by the Prometheus node exporter.
		@@ -205,7 +251,7 @@ it. This is normally auto-detected by the Prometheus node exporter.
		NOTE: those instructions are out of date and need to be rewritten for
		Prometheus, see [tpo/tpa/prometheus-alerts#16](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/16).

		### Failed disk
		## Failed disk

		A normal RAID-1 Nagios check output looks like this:

		@@ -219,7 +265,7 @@ It actually has the numbers backwards: in the above situation, there
		was only one degraded drive, and 3 healthy ones. See above for how
		to restore a drive in a MegaRAID array.

		### Disks with "other" errors
		## Disks with "other" errors

		The following warning may seem innocuous but actually reports that
		drives have "errors:
		@@ -259,42 +305,12 @@ safely ignored.
		[this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error
		[Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier

		# SMART monitoring

		Some servers will fail to properly detect disk drives in their SMART
		configuration. In particular, `smartd` does not support:

		* virtual disks (e.g. `/dev/nbd0`)
		* MMC block devices (e.g. `/dev/mmcblk0`, commonly found on ARM
		devices)
		* out of the box, CCISS raid devices (e.g. `/dev/cciss/c0d0`)

		The latter can be configured with the following snippet in
		`/etc/smartd.conf`:

		#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
		DEFAULT -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
		/dev/cciss/c0d0 -d cciss,0
		/dev/cciss/c0d0 -d cciss,1
		/dev/cciss/c0d0 -d cciss,2
		/dev/cciss/c0d0 -d cciss,3
		/dev/cciss/c0d0 -d cciss,4
		/dev/cciss/c0d0 -d cciss,5

		Notice how the `DEVICESCAN` is commented out to be replaced by the
		CCISS configuration. One line for each drive should be added (and no,
		it does not autodetect all drives unfortunately). This hack was
		deployed on `listera` which uses that hardware RAID.
		# Other documentation

		Other hardware RAID controllers are better supported. For example, the
		`megaraid` controller on `moly` was correctly detected by `smartd`
		which accurately found a broken hard drive.

		## References

		Here are some external documentation links:
		See also:

		* <https://cs.uwaterloo.ca/twiki/view/CF/MegaRaid>
		* <https://raid.wiki.kernel.org/index.php/Hardware_Raid_Setup_using_MegaCli>
		* <https://sysadmin.compxtreme.ro/how-to-replace-an-lsi-raid-disk-with-megacli/>
		* <https://wikitech.wikimedia.org/wiki/MegaCli>
		- [LVM](howto/lvm)
		- [RAID wiki](https://archive.kernel.org/oldwiki/raid.wiki.kernel.org/) (archived)
		- [md(4) manual page](https://manpages.debian.org/bookworm/mdadm/md.4.en.html)
		- [mdadm(8) manual page](https://manpages.debian.org/bookworm/mdadm/mdadm.8.en.html)
		- [md driver kernel documentation](https://docs.kernel.org/admin-guide/md.html)