retire tor-nagios-checks package
To quote [TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#pager-playbook-responses):
> ### Pager playbook responses
>
> One key difference between Nagios-style checks and Prometheus alerting is that Nagios check results are actually text strings with lots of meaning embedded into them. Checks for `needrestart`, for example, might include the processes that need a kick, or `dsa-check-packages` will list which packages need an upgrade.
>
> Prometheus doesn't give us anything like this: we can have counts and labels, so we could know, for example, how many packages are "obsolete" or "pending upgrade" but not _which_.
>
> So we'll need a mechanism to allow operators to easily extract that information. We believe this might be implemented using a Fabric script that replicates parts of what the NRPE checks currently do, which would also have the added benefit of more easily running those scripts in batch on multiple hosts.
>
> Alerts should also include references to the "Pager playbook" sections of the service documentation, as much as possible, so that tired operators that deal with an emergency can follow a quick guide directly instead of having to search documentation.
So the task here is to look at what we're actually using in `tor-nagios-checks` after the Icinga retirement, and replace those with fabric tasks.
Requires fabric to be deployed more broadly, see also #41484.
Update: there are more checks with side effects than I expected. In particular the postgresql backup checks remove old backups! So here's a check list of things to check before we can actually retire the `tor-nagios-checks` package.
Priority A:
- [x] adapt `dsa-check-backuppg` to output Prometheus metrics, as a stopgap measure
- [x] grep for `/usr/lib/nagios/plugins` or `dsa[-_]` in all cron jobs and in all our source code
- [x] audit *all* active checks for side effects
- [x] remove `tor-nagios-checks` from all servers apart from exceptions, which are `bungei` and `nevii`
Priority B:
- [x] ~~port `dsa-check-backuppg` to prometheus or~~ replace with alternative (#40950), note that there's already a cron job for expiration, separate from the NRPE check, in `/etc/cron.d/tor-backup-postgres`, deployed by puppet, to be removed
- [x] DNS checks, covered by https://gitlab.torproject.org/tpo/tpa/team/-/issues/41794 and https://gitlab.torproject.org/tpo/tpa/team/-/issues/42268
- [ ] `dsa_check_soas_add`: "checks that zones are in sync on secondaries", to be analyzed
- [ ] `dsa-check-zone-rrsig-expiration-many`: [dnssec-exporter][]? drop DNSSEC? to be analyzed
- [ ] `dsa-check-zone-signature-all`: idem
- [ ] `dsa-check-dnssec-delegation`: idem
- [ ] "DNS - key coverage": idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage` on nevii, could be converted as is
- [ ] "DNS - DS expiry": idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds` on nevii
- [x] ~~`check_ntp_time`: unclear how that differs from `check_ntp_peer`~~ followup in #41639
- [x] retire `tor-nagios-checks` from remaining servers
- [x] remove `/usr/lib/nagios/plugins` from `PATH`, in `tor-puppet/legacy/torproject_org/manifests/init.pp`:
68: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins
- [x] archive `tor-nagios` repository (already done in #40695)
- [x] mark repository as `deleted` in mr
- [x] remove `tor-nagios-checks` from archive
issue