silence errors from MegaRAID arrays on chi-node-XX
in #40732 (closed), we have reported that
Since chi-node-11 was deployed we have been seeing errors like this in Nagios:
WARNING: 0:0:RAID-1:2 drives:465.25GB:Optimal Drives:2 (1750 Errors: 0 media, 0 predictive, 1750 other)
In the controller event log (
megacli -AdpEventLog -GetEvents -f /dev/stdout -A0
) these messages are repeated:seqNum: 0x00005010 Time: Wed Apr 20 19:53:54 2022 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 00(e0x20/s0) Path 1221000000000000, CDB: 4d 00 4d 00 00 00 00 00 20 00, Sense: 5/24/00 Event Data: =========== Device ID: 0 Enclosure Index: 32 Slot Number: 0 CDB Length: 10 CDB Data: 004d 0000 004d 0000 0000 0000 0000 0000 0020 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18 Sense Data: 0070 0000 0005 0000 0000 0000 0000 000a 0000 0000 0000 0000 0024 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Those errors were ultimately determined to be harmless, but we still have recurring alerts in Nagios about this problem.
and while we can ACK those errors, in the long run it would be better if the monitoring system could tell what is a real error and what is not. It could mean ignoring "other" errors like those, or refining the group of errors to better qualify whether it's worth alerting on.