Skip to content

priority A metrics and alerts deployment

Quote from TPA-RFC-33:

We assign each Icinga check an exporter and a priority:

  • A: must have, should be completed before Icinga is shutdown, as soon as possible
  • B: should have, would ideally be done before Icinga is shutdown, but we can live without it for a while
  • C: nice to have, we can live without it
  • D: drop, we wouldn't even keep checking this in Icinga if we kept it
  • E: what on earth is this thing and how do we deal with it, to review

In the appendix, the Icinga checks inventory lists every Icinga check and what should happen with it.

Summary:

Kind Checks A B C D E Exporters
existing 8 4 4 1
missing, existing exporter 8 5 3 3
missing, new exporters 8 4 4 8
DNS 7 1 6 3?
To investigate 4 2 1 1 1 existing, 2 new?
dropped 8 8 0
delegated to service admins 4 4 4?
new exporters 0 14 (priority C)

Checks by alerting levels:

  • warning: 31
  • critical: 3
  • dropped: 12

Priority A checks are actually:

  • node exporter: up, disk usage, RAID, DRBD, APT updates
  • blackbox: SSH, SMTP, HTTP(S) latency checks (see #40568 (closed) for latency), Redis liveness
  • textfile: needrestart
  • cert exporter: cert expiry for private CA and LE certs, see also #41385 for alternatives
  • barman exporter: PostgreSQL backups validity
  • bacula exporter

Some of those are actually already scraped metrics and "just" need alerts to be defined. Those alerts will likely be defined in Puppet.

checklist of icinga alerts to cover, priority A (along with TPA-RFC-33's guess at possible prometheus equivalent metrics or exporters):

The "redis liveness" check is particularly tricky to implement, here is the magic configuration right now:

  -
    name: "redis liveness"
    nrpe: "if echo PING | nc -w 1 localhost 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
    hosts: crm-int-01

  -
    name: "redis liveness on crm-int-01 from crm-ext-01"
    nrpe: "if echo PING | nc -w 1 crm-int-01-priv 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
    hosts: crm-ext-01

Check a box only when (a) we are sure we have metrics being scraped and up to date in Prometheus and (b) that we have alerts triggered when an error condition occurs (consider triggering such error conditions).

more detailed table in https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#priority-a-1

Edited by lelutin
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information