document latest hardware raid troubles authored by anarcat's avatar anarcat
...@@ -35,6 +35,8 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt ...@@ -35,6 +35,8 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt
# Hardware RAID # Hardware RAID
## MegaCLI operation
Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid` Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid`
controllers. Those are controlled with the `MegaCLI` command that is controllers. Those are controlled with the `MegaCLI` command that is
... rather hard to use. ... rather hard to use.
...@@ -148,6 +150,53 @@ To follow progress: ...@@ -148,6 +150,53 @@ To follow progress:
watch /opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv[252:0] -a0 watch /opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv[252:0] -a0
### Rebuilding the Debian package
The Debian package is based on a binary RPM provided by upstream ([LSI
corporation](https://en.wikipedia.org/wiki/LSI_Corporation)). Unfortunately, upstream was acquired by
[Broadcom](https://en.wikipedia.org/wiki/Broadcom_Inc.) in 2014, after which their MegaCLI software development
seem to have stopped. Since then the `lsi.com` domain redirects to
`broadcom.com` and those packages -- that were already hard to find --
are getting even harder to find.
It seems the [broadcom search page](https://www.broadcom.com/support/download-search?pg=&pf=&pn=&pa=&po=&dk=megacli&pl=) is the best place to find the
megaraid stuff. In that link you should get "search results" and under
"Management Software and Tools" there should be a link to some
"MegaCLI". The latest is currently (as of 2021) 5.5 P2 (dated
2014-01-19!). Note that this version number differs from the actual
version number of the megacli binary (8.07.14). A direct link to the
package is currently:
https://docs.broadcom.com/docs-and-downloads/raid-controllers/raid-controllers-common-files/8-07-14_MegaCLI.zip
Obviously, it seems like upstream does not mind breaking those links at
any time, so you might have to redo the search to find it. In any
case, the package is based on a RPM buried in the ZIP file. So this
should get you a package:
unzip 8-07-14_MegaCLI.zip
fakeroot alien Linux/MegaCli-8.07.14-1.noarch.rpm
This gives you a `megacli_8.07.14-2_all.deb` package which normally
gets upload to the proprietary archive on `alberti`.
An alternative is to use existing packages like the ones from
[le-vert.net](https://hwraid.le-vert.net/wiki/DebianPackages). In particular, `megactl` is a free software
alternative that works on `chi-node-13`, yet not packaged in Debian so
currently not in use:
root@chi-node-13:~# megasasctl
a0 PERC 6/i Integrated encl:1 ldrv:1 batt:good
a0d0 465GiB RAID 1 1x2 optimal
a0e32s0 465GiB a0d0 online errs: media:0 other:819
a0e32s1 465GiB a0d0 online errs: media:0 other:819
root@chi-node-13:~# megasasctl
a0 PERC 6/i Integrated encl:1 ldrv:1 batt:good
a0d0 465GiB RAID 1 1x2 optimal
a0e32s0 465GiB a0d0 online errs: media:0 other:819
a0e32s1 465GiB a0d0 online errs: media:0 other:819
## Pager playbook ## Pager playbook
Nagios should be monitoring hardware RAID on servers that support Nagios should be monitoring hardware RAID on servers that support
...@@ -155,6 +204,8 @@ it. This is normally auto-detected by Puppet (in the `raid` ...@@ -155,6 +204,8 @@ it. This is normally auto-detected by Puppet (in the `raid`
module/class) but grep around for `megaraid` otherwise. The `raid` module/class) but grep around for `megaraid` otherwise. The `raid`
module should have a good README file describing how it works. module should have a good README file describing how it works.
### Failed disk
A normal RAID-1 Nagios check output looks like this: A normal RAID-1 Nagios check output looks like this:
OK: 0:0:RAID-1:2 drives:465.25GB:Optimal Drives:2 OK: 0:0:RAID-1:2 drives:465.25GB:Optimal Drives:2
...@@ -167,6 +218,46 @@ It actually has the numbers backwards: in the above situation, there ...@@ -167,6 +218,46 @@ It actually has the numbers backwards: in the above situation, there
was only *one* degraded drive, and 3 healthy ones. See above for how was only *one* degraded drive, and 3 healthy ones. See above for how
to restore a drive in a MegaRAID array. to restore a drive in a MegaRAID array.
### Disks with "other" errors
The following warning may seem innocuous but actually reports that
drives have "errors:
WARNING: 0:0:RAID-1:2 drives:465.25GB:Optimal Drives:2 (1530 Errors: 0 media, 0 predictive, 1530 other)
The `1530 Errors` part is the key here. They are "other" errors. This
can be reproduced with the `megacli` command:
# megacli -PDList -aALL | grep -e '^Enclosure Device' -e '^Slot' -e '^Firmware' -e "Error Count"
Enclosure Device ID: 32
Slot Number: 0
Media Error Count: 0
Other Error Count: 765
Firmware state: Online, Spun Up
Enclosure Device ID: 32
Slot Number: 1
Media Error Count: 0
Other Error Count: 765
Firmware state: Online, Spun Up
The actual error should also be visible in the logs:
megacli -AdpEventLog -GetLatest 100 -f events.log -aALL
... then in `events.log`, the key part is:
Event Description: Unexpected sense: PD 00(e0x20/s0) Path 1221000000000000, CDB: 4d 00 4d 00 00 00 00 00 20 00, Sense: 5/24/00
The `Sense` field is [Key Code Qualifier][] ("an error-code returned
by a SCSI device") which, for 5/24/00 means "Illegal Request - invalid
field in CDB (Command Descriptor Block) ". According to [this
discussion][] it seems that *newer* versions of the `megacli` binary
trigger those errors when older drives are in use. Those errors can be
safely ignored.
[this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error
[Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier
# SMART monitoring # SMART monitoring
Some servers will fail to properly detect disk drives in their SMART Some servers will fail to properly detect disk drives in their SMART
... ...
......