Skip to content

retire tor-nagios-checks package

To quote TPA-RFC-33:

Pager playbook responses

One key difference between Nagios-style checks and Prometheus alerting is that Nagios check results are actually text strings with lots of meaning embedded into them. Checks for needrestart, for example, might include the processes that need a kick, or dsa-check-packages will list which packages need an upgrade.

Prometheus doesn't give us anything like this: we can have counts and labels, so we could know, for example, how many packages are "obsolete" or "pending upgrade" but not which.

So we'll need a mechanism to allow operators to easily extract that information. We believe this might be implemented using a Fabric script that replicates parts of what the NRPE checks currently do, which would also have the added benefit of more easily running those scripts in batch on multiple hosts.

Alerts should also include references to the "Pager playbook" sections of the service documentation, as much as possible, so that tired operators that deal with an emergency can follow a quick guide directly instead of having to search documentation.

So the task here is to look at what we're actually using in tor-nagios-checks after the Icinga retirement, and replace those with fabric tasks.

Requires fabric to be deployed more broadly, see also #41484 (closed).

Update: there are more checks with side effects than I expected. In particular the postgresql backup checks remove old backups! So here's a check list of things to check before we can actually retire the tor-nagios-checks package.

Priority A:

  • adapt dsa-check-backuppg to output Prometheus metrics, as a stopgap measure
  • grep for /usr/lib/nagios/plugins or dsa[-_] in all cron jobs and in all our source code
  • audit all active checks for side effects
  • remove tor-nagios-checks from all servers apart from exceptions, which are bungei and nevii

Priority B:

  • port dsa-check-backuppg to prometheus or replace with alternative (#40950 (closed)), note that there's already a cron job for expiration, separate from the NRPE check, in /etc/cron.d/tor-backup-postgres, deployed by puppet, to be removed

  • DNS checks, covered by #41794 and #42268

    • dsa_check_soas_add: "checks that zones are in sync on secondaries", to be analyzed
    • dsa-check-zone-rrsig-expiration-many: [dnssec-exporter][]? drop DNSSEC? to be analyzed
    • dsa-check-zone-signature-all: idem
    • dsa-check-dnssec-delegation: idem
    • "DNS - key coverage": idem, dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage on nevii, could be converted as is
    • "DNS - DS expiry": idem, dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds on nevii
  • check_ntp_time: unclear how that differs from check_ntp_peer followup in #41639 (closed)

  • retire tor-nagios-checks from remaining servers

  • remove tor-nagios-checks from archive

  • remove /usr/lib/nagios/plugins from PATH, in tor-puppet/legacy/torproject_org/manifests/init.pp:

    68:      PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins
  • archive tor-nagios repository (already done in #40695 (closed))

  • mark repository as deleted in mr

Edited by anarcat
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information