... | ... | @@ -35,6 +35,8 @@ with SMART output](https://wiki.hetzner.de/index.php/Seriennummern_von_Festplatt |
|
|
|
|
|
# Hardware RAID
|
|
|
|
|
|
## MegaCLI operation
|
|
|
|
|
|
Some TPO machines --particularly [at cymru](howto/new-machine-cymru) -- have hardware RAID with `megaraid`
|
|
|
controllers. Those are controlled with the `MegaCLI` command that is
|
|
|
... rather hard to use.
|
... | ... | @@ -148,6 +150,53 @@ To follow progress: |
|
|
|
|
|
watch /opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv[252:0] -a0
|
|
|
|
|
|
### Rebuilding the Debian package
|
|
|
|
|
|
The Debian package is based on a binary RPM provided by upstream ([LSI
|
|
|
corporation](https://en.wikipedia.org/wiki/LSI_Corporation)). Unfortunately, upstream was acquired by
|
|
|
[Broadcom](https://en.wikipedia.org/wiki/Broadcom_Inc.) in 2014, after which their MegaCLI software development
|
|
|
seem to have stopped. Since then the `lsi.com` domain redirects to
|
|
|
`broadcom.com` and those packages -- that were already hard to find --
|
|
|
are getting even harder to find.
|
|
|
|
|
|
It seems the [broadcom search page](https://www.broadcom.com/support/download-search?pg=&pf=&pn=&pa=&po=&dk=megacli&pl=) is the best place to find the
|
|
|
megaraid stuff. In that link you should get "search results" and under
|
|
|
"Management Software and Tools" there should be a link to some
|
|
|
"MegaCLI". The latest is currently (as of 2021) 5.5 P2 (dated
|
|
|
2014-01-19!). Note that this version number differs from the actual
|
|
|
version number of the megacli binary (8.07.14). A direct link to the
|
|
|
package is currently:
|
|
|
|
|
|
https://docs.broadcom.com/docs-and-downloads/raid-controllers/raid-controllers-common-files/8-07-14_MegaCLI.zip
|
|
|
|
|
|
Obviously, it seems like upstream does not mind breaking those links at
|
|
|
any time, so you might have to redo the search to find it. In any
|
|
|
case, the package is based on a RPM buried in the ZIP file. So this
|
|
|
should get you a package:
|
|
|
|
|
|
unzip 8-07-14_MegaCLI.zip
|
|
|
fakeroot alien Linux/MegaCli-8.07.14-1.noarch.rpm
|
|
|
|
|
|
This gives you a `megacli_8.07.14-2_all.deb` package which normally
|
|
|
gets upload to the proprietary archive on `alberti`.
|
|
|
|
|
|
An alternative is to use existing packages like the ones from
|
|
|
[le-vert.net](https://hwraid.le-vert.net/wiki/DebianPackages). In particular, `megactl` is a free software
|
|
|
alternative that works on `chi-node-13`, yet not packaged in Debian so
|
|
|
currently not in use:
|
|
|
|
|
|
root@chi-node-13:~# megasasctl
|
|
|
a0 PERC 6/i Integrated encl:1 ldrv:1 batt:good
|
|
|
a0d0 465GiB RAID 1 1x2 optimal
|
|
|
a0e32s0 465GiB a0d0 online errs: media:0 other:819
|
|
|
a0e32s1 465GiB a0d0 online errs: media:0 other:819
|
|
|
|
|
|
root@chi-node-13:~# megasasctl
|
|
|
a0 PERC 6/i Integrated encl:1 ldrv:1 batt:good
|
|
|
a0d0 465GiB RAID 1 1x2 optimal
|
|
|
a0e32s0 465GiB a0d0 online errs: media:0 other:819
|
|
|
a0e32s1 465GiB a0d0 online errs: media:0 other:819
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
Nagios should be monitoring hardware RAID on servers that support
|
... | ... | @@ -155,6 +204,8 @@ it. This is normally auto-detected by Puppet (in the `raid` |
|
|
module/class) but grep around for `megaraid` otherwise. The `raid`
|
|
|
module should have a good README file describing how it works.
|
|
|
|
|
|
### Failed disk
|
|
|
|
|
|
A normal RAID-1 Nagios check output looks like this:
|
|
|
|
|
|
OK: 0:0:RAID-1:2 drives:465.25GB:Optimal Drives:2
|
... | ... | @@ -167,6 +218,46 @@ It actually has the numbers backwards: in the above situation, there |
|
|
was only *one* degraded drive, and 3 healthy ones. See above for how
|
|
|
to restore a drive in a MegaRAID array.
|
|
|
|
|
|
### Disks with "other" errors
|
|
|
|
|
|
The following warning may seem innocuous but actually reports that
|
|
|
drives have "errors:
|
|
|
|
|
|
WARNING: 0:0:RAID-1:2 drives:465.25GB:Optimal Drives:2 (1530 Errors: 0 media, 0 predictive, 1530 other)
|
|
|
|
|
|
The `1530 Errors` part is the key here. They are "other" errors. This
|
|
|
can be reproduced with the `megacli` command:
|
|
|
|
|
|
# megacli -PDList -aALL | grep -e '^Enclosure Device' -e '^Slot' -e '^Firmware' -e "Error Count"
|
|
|
Enclosure Device ID: 32
|
|
|
Slot Number: 0
|
|
|
Media Error Count: 0
|
|
|
Other Error Count: 765
|
|
|
Firmware state: Online, Spun Up
|
|
|
Enclosure Device ID: 32
|
|
|
Slot Number: 1
|
|
|
Media Error Count: 0
|
|
|
Other Error Count: 765
|
|
|
Firmware state: Online, Spun Up
|
|
|
|
|
|
The actual error should also be visible in the logs:
|
|
|
|
|
|
megacli -AdpEventLog -GetLatest 100 -f events.log -aALL
|
|
|
|
|
|
... then in `events.log`, the key part is:
|
|
|
|
|
|
Event Description: Unexpected sense: PD 00(e0x20/s0) Path 1221000000000000, CDB: 4d 00 4d 00 00 00 00 00 20 00, Sense: 5/24/00
|
|
|
|
|
|
The `Sense` field is [Key Code Qualifier][] ("an error-code returned
|
|
|
by a SCSI device") which, for 5/24/00 means "Illegal Request - invalid
|
|
|
field in CDB (Command Descriptor Block) ". According to [this
|
|
|
discussion][] it seems that *newer* versions of the `megacli` binary
|
|
|
trigger those errors when older drives are in use. Those errors can be
|
|
|
safely ignored.
|
|
|
|
|
|
[this discussion]: https://serverfault.com/questions/482705/megacli-causes-drive-other-error
|
|
|
[Key Code Qualifier]: https://en.wikipedia.org/wiki/Key_Code_Qualifier
|
|
|
|
|
|
# SMART monitoring
|
|
|
|
|
|
Some servers will fail to properly detect disk drives in their SMART
|
... | ... | |