Skip to content

audit last year's nagios notifications for proper coverage in Prometheus

compare nagios alerts with prometheus to make sure we are not missing anything (see also #41713 (closed) for another audit).

Those are the alerts that fired in the last year since this issue was filed, in decreasing number of occurences:

  • apt - security updates (#41633 (closed), missing timestamp in #41355 (closed) and obsolete #41712 (closed))
  • needrestart (#41633 (closed))
  • uptime check (rejected in TPA-RFC-33)
  • load (planned in phase B, #41639 (closed))
  • disk usage (#41633 (closed))
  • system - all services running (planned in phase B, #41639 (closed))
  • network service, mostly covered by the blackbox exporter (#41632 (closed)), that is:
  • PING (planned in phase B, #41639 (closed))
  • puppet - catalog run (planned in phase B, #41639 (closed))
  • is DOWN ** (#41633 (closed), #41632 (closed))
  • RAID - DRBD (#41633 (closed))
  • process (rejected in TPA-RFC-33)
  • setup - ud-ldap freshness (planned in phase B, #41639 (closed))
  • swap usage - percent (planned in phase B, #41639 (closed))
  • backup (#41633 (closed))
  • puppet - all catalog runs (planned in phase B, #41639 (closed))
  • application service - bridgedb status (delegated in TPA-RFC-33, covered by anticensorship)
  • DNS SOA sync (planned in phase B, #41639 (closed))
  • unwanted (rejected in TPA-RFC-33)
  • collector2 (delegated in TPA-RFC-33, to confirm)
  • mirror static sync (planned in phase B, #41639 (closed))
  • DNS (planned in phase B, #41794)
    • DNS SOA sync
    • DNS - DS expiry
    • DNS - delegation and signature expiry
    • DNS - security delegations
    • DNS - zones signed properly
  • processes - total (rejected in TPA-RFC-33)
  • Ganeti - cluster (planned in phase B, #41639 (closed))
  • SSL cert - db.torproject.org (planned in phase B, #41732 (closed))
  • collector (delegated in TPA-RFC-33, to confirm)
  • Ganeti - instances (planned in phase B, #41639 (closed))
  • system - filesystem check (planned in phase B, #41639 (closed))
  • swap usage - mb (planned in phase B, #41639 (closed))
  • unbound trust anchors (planned in phase B, #41639 (closed))
  • mail queue (delegated in TPA-RFC-33, rejected by anticensorship)
  • CPU - intel ucode (#41633 (closed))
  • postgresql backups (#41633 (closed), #41774 (closed))
  • onionoo - ping backend onionoo-backend-02 (delegated in TPA-RFC-33, to confirm)
  • RAID - sw raid (#41633 (closed))
  • processes - zombies (rejected in TPA-RFC-33)
  • RAID - megaraid SAS (N/A, host retirement)
  • network - v6 gw (rejected in TPA-RFC-33)
  • users (rejected in TPA-RFC-33)
  • SSL cert - host (postponed to phase B, #41732 (closed))
  • redis liveness on crm-int-01 from crm-ext-01 (#41633 (closed))
  • RAID - unexpected sw raid (rejected in TPA-RFC-33)
  • bridges.tpo web service (delegated in TPA-RFC-33, covered by anticensorship)
  • SSL certs - LE (#41633 (closed), also postponed to phase B in #41731 (closed))
  • mirror sync (planned in phase B, #41639 (closed))
  • onionoo - ping backend onionoo-backend-01 (delegated in TPA-RFC-33, to confirm)
  • SAN health status (N/A, host retirement)
  • translation cron - stuck (removed before retirement)
  • icmp probe - RAID controller module 1 port 2 iSCSI interface (N/A, host retirement)
  • icmp probe - RAID controller module 0 management interface (N/A, host retirement)
Edited by anarcat
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information