priority A metrics and alerts deployment
Quote from TPA-RFC-33:
We assign each Icinga check an exporter and a priority:
- A: must have, should be completed before Icinga is shutdown, as soon as possible
- B: should have, would ideally be done before Icinga is shutdown, but we can live without it for a while
- C: nice to have, we can live without it
- D: drop, we wouldn't even keep checking this in Icinga if we kept it
- E: what on earth is this thing and how do we deal with it, to review
In the appendix, the Icinga checks inventory lists every Icinga check and what should happen with it.
Summary:
Kind Checks A B C D E Exporters existing 8 4 4 1 missing, existing exporter 8 5 3 3 missing, new exporters 8 4 4 8 DNS 7 1 6 3? To investigate 4 2 1 1 1 existing, 2 new? dropped 8 8 0 delegated to service admins 4 4 4? new exporters 0 14 (priority C) Checks by alerting levels:
- warning: 31
- critical: 3
- dropped: 12
Priority A checks are actually:
Some of those are actually already scraped metrics and "just" need alerts to be defined. Those alerts will likely be defined in Puppet.
checklist of icinga alerts to cover, priority A (along with TPA-RFC-33's guess at possible prometheus equivalent metrics or exporters):
-
check_disk
(node_filesystem_avail_bytes
) -
check_nrpe
(up
) -
dsa-check-drbd
(node_drbd_out_of_sync_bytes
,node_drbd_connected
, note: DRBD 9 not supported, alternatives: ha_cluster_exporter, drbd-reactor) -
dsa-check-raid-sw
(node_md_disks
/node_md_state
, note see also this post) -
needrestart -p
(kernel_status
,microcode_status
, note, needs development: not supported upstream, alternative implementation lacking) -
check_ssh --timeout=40
(probe_success
in blackbox, possibly covered by #41632 ) -
check_smtp
(probe_success
, possibly covered by #41632, also need end-to-end deliverability checks, to be done in a further phase) -
check_http
(probe_success
,probe_duration_seconds
, critical only for key sites, after significant delay, see also #40568) -
check_https
(idem) -
dsa-check-cert-expire
(cert-exporter, checks local CA for expiry, on disk,/etc/ssl/certs/thishost.pem
anddb.torproject.org.pem
on each host) -
dsa_check_cert
(cert-exporter, check for cert expiry for all sites, the above will check for real user-visible failures, this is about "pending renewal failed", nagios checks for 14 days, see also #41385) -
"redis liveness" (blackbox, checks that the Redis tunnel works, might require blackbox exporter, possibly better served by end-to-end donation testing?) -
dsa-check-backuppg
(barman-exporter, tricky dependency on barman rebuild, maybe builtin?) -
dsa-check-bacula
(bacula-exporter, see also WMF's check_bacula.py)
The "redis liveness" check is particularly tricky to implement, here is the magic configuration right now:
-
name: "redis liveness"
nrpe: "if echo PING | nc -w 1 localhost 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
hosts: crm-int-01
-
name: "redis liveness on crm-int-01 from crm-ext-01"
nrpe: "if echo PING | nc -w 1 crm-int-01-priv 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
hosts: crm-ext-01
Check a box only when (a) we are sure we have metrics being scraped and up to date in Prometheus and (b) that we have alerts triggered when an error condition occurs (consider triggering such error conditions).
more detailed table in https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#priority-a-1