retire tor-nagios-checks package
To quote [TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#pager-playbook-responses): > ### Pager playbook responses > > One key difference between Nagios-style checks and Prometheus alerting is that Nagios check results are actually text strings with lots of meaning embedded into them. Checks for `needrestart`, for example, might include the processes that need a kick, or `dsa-check-packages` will list which packages need an upgrade. > > Prometheus doesn't give us anything like this: we can have counts and labels, so we could know, for example, how many packages are "obsolete" or "pending upgrade" but not _which_. > > So we'll need a mechanism to allow operators to easily extract that information. We believe this might be implemented using a Fabric script that replicates parts of what the NRPE checks currently do, which would also have the added benefit of more easily running those scripts in batch on multiple hosts. > > Alerts should also include references to the "Pager playbook" sections of the service documentation, as much as possible, so that tired operators that deal with an emergency can follow a quick guide directly instead of having to search documentation. So the task here is to look at what we're actually using in `tor-nagios-checks` after the Icinga retirement, and replace those with fabric tasks. Requires fabric to be deployed more broadly, see also #41484. Update: there are more checks with side effects than I expected. In particular the postgresql backup checks remove old backups! So here's a check list of things to check before we can actually retire the `tor-nagios-checks` package. Priority A: - [x] adapt `dsa-check-backuppg` to output Prometheus metrics, as a stopgap measure - [x] grep for `/usr/lib/nagios/plugins` or `dsa[-_]` in all cron jobs and in all our source code - [x] audit *all* active checks for side effects - [x] remove `tor-nagios-checks` from all servers apart from exceptions, which are `bungei` and `nevii` Priority B: - [x] ~~port `dsa-check-backuppg` to prometheus or~~ replace with alternative (#40950), note that there's already a cron job for expiration, separate from the NRPE check, in `/etc/cron.d/tor-backup-postgres`, deployed by puppet, to be removed - [x] DNS checks, covered by https://gitlab.torproject.org/tpo/tpa/team/-/issues/41794 and https://gitlab.torproject.org/tpo/tpa/team/-/issues/42268 - [ ] `dsa_check_soas_add`: "checks that zones are in sync on secondaries", to be analyzed - [ ] `dsa-check-zone-rrsig-expiration-many`: [dnssec-exporter][]? drop DNSSEC? to be analyzed - [ ] `dsa-check-zone-signature-all`: idem - [ ] `dsa-check-dnssec-delegation`: idem - [ ] "DNS - key coverage": idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage` on nevii, could be converted as is - [ ] "DNS - DS expiry": idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds` on nevii - [x] ~~`check_ntp_time`: unclear how that differs from `check_ntp_peer`~~ followup in #41639 - [x] retire `tor-nagios-checks` from remaining servers - [x] remove `/usr/lib/nagios/plugins` from `PATH`, in `tor-puppet/legacy/torproject_org/manifests/init.pp`: 68: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins - [x] archive `tor-nagios` repository (already done in #40695) - [x] mark repository as `deleted` in mr - [x] remove `tor-nagios-checks` from archive
issue