priority A metrics and alerts deployment
Quote from TPA-RFC-33:
We assign each Icinga check an exporter and a priority:
- A: must have, should be completed before Icinga is shutdown, as soon as possible
- B: should have, would ideally be done before Icinga is shutdown, but we can live without it for a while
- C: nice to have, we can live without it
- D: drop, we wouldn't even keep checking this in Icinga if we kept it
- E: what on earth is this thing and how do we deal with it, to review
In the appendix, the Icinga checks inventory lists every Icinga check and what should happen with it.
Summary:
Kind Checks A B C D E Exporters existing 8 4 4 1 missing, existing exporter 8 5 3 3 missing, new exporters 8 4 4 8 DNS 7 1 6 3? To investigate 4 2 1 1 1 existing, 2 new? dropped 8 8 0 delegated to service admins 4 4 4? new exporters 0 14 (priority C) Checks by alerting levels:
- warning: 31
- critical: 3
- dropped: 12
Priority A checks are actually:
- node exporter:
up
, disk usage, RAID, DRBD, APT updates- blackbox: SSH, SMTP, HTTP(S)
latency checks(see #40568 (closed) for latency), Redis liveness- textfile: needrestart
- cert exporter: cert expiry for private CA and LE certs, see also #41385 for alternatives
- barman exporter: PostgreSQL backups validity
- bacula exporter
Some of those are actually already scraped metrics and "just" need alerts to be defined. Those alerts will likely be defined in Puppet.
checklist of icinga alerts to cover, priority A (along with TPA-RFC-33's guess at possible prometheus equivalent metrics or exporters):
-
check_disk
(node_filesystem_avail_bytes
) -
check_nrpe
(up
) -
dsa-check-drbd
(node_drbd_out_of_sync_bytes
,node_drbd_connected
, note: DRBD 9 not supported, alternatives: ha_cluster_exporter, drbd-reactor) -
dsa-check-raid-sw
(node_md_disks
/node_md_state
, note see also this post) -
"apt - security updates" (alerts for apt_upgrades_*
metrics fromapt_info.py
, missing alerts delegated to phase B, see #41712, @anarcat, see also #41355 (closed) for the timestamp metric that we might want here) -
needrestart -p
(kernel_status
,microcode_status
, note, needs development: not supported upstream, alternative implementation lacking) (@lelutin) -
check_ssh --timeout=40
(probe_success
in blackbox, possibly covered by #41632 (closed) ) -
check_smtp
(ports 25, 587 and 465, depending on the server,probe_success
, possibly covered by #41632 (closed), also need end-to-end deliverability checks, to be done in a further phase, @lelutin) -
check_http
(probe_success
,probe_duration_seconds
, critical only for key sites, after significant delay, see also #40568 (closed)) -
check_https
(idem) -
, postponed to #41732 (closed)dsa-check-cert-expire
(cert-exporter, checks local CA for expiry, on disk,/etc/ssl/certs/thishost.pem
anddb.torproject.org.pem
on each host) -
dsa_check_cert
(cert-exporter, check for cert expiry for all sites, the above will check for real user-visible failures, this is about "pending renewal failed", nagios checks for 14 days, see also #41385, possiblyprobe_ssl_earliest_cert_expiry
easier to deploy for "all sites", e.g.(probe_ssl_earliest_cert_expiry - time()) < 30*24*60*60
) -
"redis liveness" (blackbox, checks that the Redis tunnel works, might require blackbox exporter, possibly better served by end-to-end donation testing?) (@lelutin) -
dsa-check-backuppg
(barman-exporter, tricky dependency on barman rebuild, not builtin?this utterly failed, see #40950, will need to port existing checks, see #41774 (closed)) (@anarcat) -
dsa-check-bacula
(bacula-exporter, see also WMF's check_bacula.py) (@lelutin ) -
mininag needs to be patched because it likely talks to NRPE, to verify (@anarcat)split out into #41734 (closed)
The "redis liveness" check is particularly tricky to implement, here is the magic configuration right now:
-
name: "redis liveness"
nrpe: "if echo PING | nc -w 1 localhost 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
hosts: crm-int-01
-
name: "redis liveness on crm-int-01 from crm-ext-01"
nrpe: "if echo PING | nc -w 1 crm-int-01-priv 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
hosts: crm-ext-01
Check a box only when (a) we are sure we have metrics being scraped and up to date in Prometheus and (b) that we have alerts triggered when an error condition occurs (consider triggering such error conditions).
more detailed table in https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#priority-a-1