replace remaining tor-nagios-checks diagnostic tools with fabric
To quote TPA-RFC-33:
Pager playbook responses
One key difference between Nagios-style checks and Prometheus alerting is that Nagios check results are actually text strings with lots of meaning embedded into them. Checks for
needrestart
, for example, might include the processes that need a kick, ordsa-check-packages
will list which packages need an upgrade.Prometheus doesn't give us anything like this: we can have counts and labels, so we could know, for example, how many packages are "obsolete" or "pending upgrade" but not which.
So we'll need a mechanism to allow operators to easily extract that information. We believe this might be implemented using a Fabric script that replicates parts of what the NRPE checks currently do, which would also have the added benefit of more easily running those scripts in batch on multiple hosts.
Alerts should also include references to the "Pager playbook" sections of the service documentation, as much as possible, so that tired operators that deal with an emergency can follow a quick guide directly instead of having to search documentation.
So the task here is to look at what we're actually using in tor-nagios-checks
after the Icinga retirement, and replace those with fabric tasks.
Requires fabric to be deployed more broadly, see also #41484.
Update: there are more checks with side effects than I expected. In particular the postgresql backup checks remove old backups! So here's a check list of things to check before we can actually retire the tor-nagios-checks
package.
Priority A:
-
adapt dsa-check-backuppg
to output Prometheus metrics, as a stopgap measure -
grep for /usr/lib/nagios/plugins
ordsa[-_]
in all cron jobs and in all our source code -
audit all active checks for side effects -
remove tor-nagios-checks
from all servers apart from exceptions, which arebungei
andnevii
Priority B:
-
port dsa-check-backuppg
to prometheus or replace with alternative (#40950), note that there's already a cron job for expiration, separate from the NRPE check, in/etc/cron.d/tor-backup-postgres
, deployed by puppet -
DNS checks, probably covered by #41794 -
dsa_check_soas_add
: "checks that zones are in sync on secondaries", to be analyzed -
dsa-check-zone-rrsig-expiration-many
: [dnssec-exporter][]? drop DNSSEC? to be analyzed -
dsa-check-zone-signature-all
: idem -
dsa-check-dnssec-delegation
: idem -
"DNS - key coverage": idem, dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage
on nevii, could be converted as is -
"DNS - DS expiry": idem, dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds
on nevii
-
-
followup in #41639check_ntp_time
: unclear how that differs fromcheck_ntp_peer
-
retire tor-nagios-checks
from remaining servers -
remove tor-nagios-checks
from archive -
archive tor-nagios
repository (already done in #40695 (closed))