audit last year's nagios notifications for proper coverage in Prometheus
compare nagios alerts with prometheus to make sure we are not missing anything (see also #41713 (closed) for another audit).
Those are the alerts that fired in the last year since this issue was filed, in decreasing number of occurences:
-
apt - security updates (#41633 (closed), missing timestamp in #41355 (closed) and obsolete #41712) -
needrestart (#41633 (closed)) -
uptime check (rejected in TPA-RFC-33) -
load (planned in phase B, #41639) -
disk usage (#41633 (closed)) -
system - all services running (planned in phase B, #41639) -
network service, mostly covered by the blackbox exporter (#41632 (closed)), that is: -
onionoo varnish (delegated in TPA-RFC-33, to confirm) -
onionoo full service check (delegated in TPA-RFC-33, to confirm) -
onionoo backend (delegated in TPA-RFC-33, to confirm) -
nrpe (N/A, equivalent to JobDown
check now, #41633 (closed)) -
https (#41632 (closed)) -
ntp peer (planned in phase B, #41639) -
ntp time (planned in phase B, #41639) -
https cert (#41632 (closed), #41633 (closed), also postponed to phase B in #41731 (closed)) -
http (#41632 (closed)) -
sshd (#41632 (closed)) -
smtp (#41632 (closed))
-
-
PING (planned in phase B, #41639) -
puppet - catalog run (planned in phase B, #41639) -
is DOWN ** (#41633 (closed), #41632 (closed)) -
RAID - DRBD (#41633 (closed)) -
process (rejected in TPA-RFC-33) -
setup - ud-ldap freshness (planned in phase B, #41639) -
swap usage - percent (planned in phase B, #41639) -
backup (#41633 (closed)) -
puppet - all catalog runs (planned in phase B, #41639) -
application service - bridgedb status (delegated in TPA-RFC-33, covered by anticensorship) -
DNS SOA sync (planned in phase B, #41639) -
unwanted (rejected in TPA-RFC-33) -
collector2 (delegated in TPA-RFC-33, to confirm) -
mirror static sync (planned in phase B, #41639) -
DNS (planned in phase B, #41794) -
DNS SOA sync -
DNS - DS expiry -
DNS - delegation and signature expiry -
DNS - security delegations -
DNS - zones signed properly
-
-
processes - total (rejected in TPA-RFC-33) -
Ganeti - cluster (planned in phase B, #41639) -
SSL cert - db.torproject.org (planned in phase B, #41732 (closed)) -
collector (delegated in TPA-RFC-33, to confirm) -
Ganeti - instances (planned in phase B, #41639) -
system - filesystem check (planned in phase B, #41639) -
swap usage - mb (planned in phase B, #41639) -
unbound trust anchors (planned in phase B, #41639) -
mail queue (delegated in TPA-RFC-33, rejected by anticensorship) -
CPU - intel ucode (#41633 (closed)) -
postgresql backups (#41633 (closed), #41774 (closed)) -
onionoo - ping backend onionoo-backend-02 (delegated in TPA-RFC-33, to confirm) -
RAID - sw raid (#41633 (closed)) -
processes - zombies (rejected in TPA-RFC-33) -
RAID - megaraid SAS (N/A, host retirement) -
network - v6 gw (rejected in TPA-RFC-33) -
users (rejected in TPA-RFC-33) -
SSL cert - host (postponed to phase B, #41732 (closed)) -
redis liveness on crm-int-01 from crm-ext-01 (#41633 (closed)) -
RAID - unexpected sw raid (rejected in TPA-RFC-33) -
bridges.tpo web service (delegated in TPA-RFC-33, covered by anticensorship) -
SSL certs - LE (#41633 (closed), also postponed to phase B in #41731 (closed)) -
mirror sync (planned in phase B, #41639) -
onionoo - ping backend onionoo-backend-01 (delegated in TPA-RFC-33, to confirm) -
SAN health status (N/A, host retirement) -
translation cron - stuck (removed before retirement) -
icmp probe - RAID controller module 1 port 2 iSCSI interface (N/A, host retirement) -
icmp probe - RAID controller module 0 management interface (N/A, host retirement)
Edited by anarcat