retire tor-nagios-checks package
To quote TPA-RFC-33:
Pager playbook responses
One key difference between Nagios-style checks and Prometheus alerting is that Nagios check results are actually text strings with lots of meaning embedded into them. Checks for
needrestart
, for example, might include the processes that need a kick, ordsa-check-packages
will list which packages need an upgrade.Prometheus doesn't give us anything like this: we can have counts and labels, so we could know, for example, how many packages are "obsolete" or "pending upgrade" but not which.
So we'll need a mechanism to allow operators to easily extract that information. We believe this might be implemented using a Fabric script that replicates parts of what the NRPE checks currently do, which would also have the added benefit of more easily running those scripts in batch on multiple hosts.
Alerts should also include references to the "Pager playbook" sections of the service documentation, as much as possible, so that tired operators that deal with an emergency can follow a quick guide directly instead of having to search documentation.
So the task here is to look at what we're actually using in tor-nagios-checks
after the Icinga retirement, and replace those with fabric tasks.
Requires fabric to be deployed more broadly, see also #41484.
Update: there are more checks with side effects than I expected. In particular the postgresql backup checks remove old backups! So here's a check list of things to check before we can actually retire the tor-nagios-checks
package.
Priority A:
-
adapt
dsa-check-backuppg
to output Prometheus metrics, as a stopgap measure -
grep for
/usr/lib/nagios/plugins
ordsa[-_]
in all cron jobs and in all our source code - audit all active checks for side effects
-
remove
tor-nagios-checks
from all servers apart from exceptions, which arebungei
andnevii
Priority B:
-
portreplace with alternative (#40950), note that there's already a cron job for expiration, separate from the NRPE check, indsa-check-backuppg
to prometheus or/etc/cron.d/tor-backup-postgres
, deployed by puppet, to be removed -
DNS checks, covered by #41794
-
dsa_check_soas_add
: "checks that zones are in sync on secondaries", to be analyzed -
dsa-check-zone-rrsig-expiration-many
: [dnssec-exporter][]? drop DNSSEC? to be analyzed -
dsa-check-zone-signature-all
: idem -
dsa-check-dnssec-delegation
: idem -
"DNS - key coverage": idem,
dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage
on nevii, could be converted as is -
"DNS - DS expiry": idem,
dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds
on nevii
-
-
followup in #41639 (closed)check_ntp_time
: unclear how that differs fromcheck_ntp_peer
-
retire
tor-nagios-checks
from remaining servers -
remove
tor-nagios-checks
from archive -
remove
/usr/lib/nagios/plugins
fromPATH
, intor-puppet/legacy/torproject_org/manifests/init.pp
:68: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins
-
archive
tor-nagios
repository (already done in #40695 (closed)) -
mark repository as
deleted
in mr
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- anarcat marked this issue as related to #40695 (closed)
marked this issue as related to #40695 (closed)
- anarcat changed milestone to %TPA-RFC-33-B: Prometheus server merge, more exporters
changed milestone to %TPA-RFC-33-B: Prometheus server merge, more exporters
- anarcat added Prometheus Roadmap::Future Technical Debt labels
added Prometheus Roadmap::Future Technical Debt labels
here's the content of that package on perdulce:
root@perdulce:~# dpkg -L tor-nagios-checks /. /etc /etc/cron.d /etc/cron.d/tor-nagios-checks /etc/nagios /etc/nagios/check-libs.conf.sample /etc/nagios/dsa-check-backuppg.conf.sample /etc/nagios/nrpe.d /etc/nagios/obsolete-packages-ignore /etc/nagios/obsolete-packages-ignore.d /usr /usr/lib /usr/lib/nagios /usr/lib/nagios/plugins /usr/lib/nagios/plugins/check_ganeti_cluster /usr/lib/nagios/plugins/check_ganeti_disks /usr/lib/nagios/plugins/check_ganeti_instances /usr/lib/nagios/plugins/check_md3000i.pl /usr/lib/nagios/plugins/check_puppet_agent /usr/lib/nagios/plugins/check_puppetdb_nodes /usr/lib/nagios/plugins/dsa-check-backuppg /usr/lib/nagios/plugins/dsa-check-bacula /usr/lib/nagios/plugins/dsa-check-cert-expire /usr/lib/nagios/plugins/dsa-check-cert-expire-dir /usr/lib/nagios/plugins/dsa-check-config /usr/lib/nagios/plugins/dsa-check-dabackup /usr/lib/nagios/plugins/dsa-check-dabackup-server /usr/lib/nagios/plugins/dsa-check-dchroots-current /usr/lib/nagios/plugins/dsa-check-dnssec-delegation /usr/lib/nagios/plugins/dsa-check-drbd /usr/lib/nagios/plugins/dsa-check-file_age /usr/lib/nagios/plugins/dsa-check-filesystems /usr/lib/nagios/plugins/dsa-check-hpacucli /usr/lib/nagios/plugins/dsa-check-hpasm /usr/lib/nagios/plugins/dsa-check-ipv6-default-gw /usr/lib/nagios/plugins/dsa-check-libs /usr/lib/nagios/plugins/dsa-check-mirrorsync /usr/lib/nagios/plugins/dsa-check-msa-eventlog /usr/lib/nagios/plugins/dsa-check-packages /usr/lib/nagios/plugins/dsa-check-port-closed /usr/lib/nagios/plugins/dsa-check-raid-3ware /usr/lib/nagios/plugins/dsa-check-raid-aacraid /usr/lib/nagios/plugins/dsa-check-raid-areca /usr/lib/nagios/plugins/dsa-check-raid-dac960 /usr/lib/nagios/plugins/dsa-check-raid-megaraid /usr/lib/nagios/plugins/dsa-check-raid-megaraid-sas /usr/lib/nagios/plugins/dsa-check-raid-mpt /usr/lib/nagios/plugins/dsa-check-raid-sw /usr/lib/nagios/plugins/dsa-check-running-kernel /usr/lib/nagios/plugins/dsa-check-samhain /usr/lib/nagios/plugins/dsa-check-soas /usr/lib/nagios/plugins/dsa-check-statusfile /usr/lib/nagios/plugins/dsa-check-stunnel-sanity /usr/lib/nagios/plugins/dsa-check-ucode-intel /usr/lib/nagios/plugins/dsa-check-udldap-freshness /usr/lib/nagios/plugins/dsa-check-unbound-anchors /usr/lib/nagios/plugins/dsa-check-uptime /usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration /usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration-many /usr/lib/nagios/plugins/dsa-check-zone-signature-all /usr/lib/nagios/plugins/tor-check-collector /usr/lib/nagios/plugins/tor-check-onionoo /usr/sbin /usr/sbin/dsa-update-apt-status /usr/sbin/dsa-update-samhain-status /usr/share /usr/share/doc /usr/share/doc/tor-nagios-checks /usr/share/doc/tor-nagios-checks/README.Debian /usr/share/doc/tor-nagios-checks/changelog.gz /usr/share/doc/tor-nagios-checks/copyright /usr/share/dsa /usr/share/dsa/apt-status-check /usr/share/dsa/weak-ssh-keys-check /var /var/cache /var/cache/dsa /var/cache/dsa/nagios
actual commands are:
root@perdulce:~# dpkg -L tor-nagios-checks | grep -e plugins/ -e bin/ | sort /usr/lib/nagios/plugins/check_ganeti_cluster /usr/lib/nagios/plugins/check_ganeti_disks /usr/lib/nagios/plugins/check_ganeti_instances /usr/lib/nagios/plugins/check_md3000i.pl /usr/lib/nagios/plugins/check_puppet_agent /usr/lib/nagios/plugins/check_puppetdb_nodes /usr/lib/nagios/plugins/dsa-check-backuppg /usr/lib/nagios/plugins/dsa-check-bacula /usr/lib/nagios/plugins/dsa-check-cert-expire /usr/lib/nagios/plugins/dsa-check-cert-expire-dir /usr/lib/nagios/plugins/dsa-check-config /usr/lib/nagios/plugins/dsa-check-dabackup /usr/lib/nagios/plugins/dsa-check-dabackup-server /usr/lib/nagios/plugins/dsa-check-dchroots-current /usr/lib/nagios/plugins/dsa-check-dnssec-delegation /usr/lib/nagios/plugins/dsa-check-drbd /usr/lib/nagios/plugins/dsa-check-file_age /usr/lib/nagios/plugins/dsa-check-filesystems /usr/lib/nagios/plugins/dsa-check-hpacucli /usr/lib/nagios/plugins/dsa-check-hpasm /usr/lib/nagios/plugins/dsa-check-ipv6-default-gw /usr/lib/nagios/plugins/dsa-check-libs /usr/lib/nagios/plugins/dsa-check-mirrorsync /usr/lib/nagios/plugins/dsa-check-msa-eventlog /usr/lib/nagios/plugins/dsa-check-packages /usr/lib/nagios/plugins/dsa-check-port-closed /usr/lib/nagios/plugins/dsa-check-raid-3ware /usr/lib/nagios/plugins/dsa-check-raid-aacraid /usr/lib/nagios/plugins/dsa-check-raid-areca /usr/lib/nagios/plugins/dsa-check-raid-dac960 /usr/lib/nagios/plugins/dsa-check-raid-megaraid /usr/lib/nagios/plugins/dsa-check-raid-megaraid-sas /usr/lib/nagios/plugins/dsa-check-raid-mpt /usr/lib/nagios/plugins/dsa-check-raid-sw /usr/lib/nagios/plugins/dsa-check-running-kernel /usr/lib/nagios/plugins/dsa-check-samhain /usr/lib/nagios/plugins/dsa-check-soas /usr/lib/nagios/plugins/dsa-check-statusfile /usr/lib/nagios/plugins/dsa-check-stunnel-sanity /usr/lib/nagios/plugins/dsa-check-ucode-intel /usr/lib/nagios/plugins/dsa-check-udldap-freshness /usr/lib/nagios/plugins/dsa-check-unbound-anchors /usr/lib/nagios/plugins/dsa-check-uptime /usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration /usr/lib/nagios/plugins/dsa-check-zone-rrsig-expiration-many /usr/lib/nagios/plugins/dsa-check-zone-signature-all /usr/lib/nagios/plugins/tor-check-collector /usr/lib/nagios/plugins/tor-check-onionoo /usr/sbin/dsa-update-apt-status /usr/sbin/dsa-update-samhain-status
out of those, i can really think of only one check that we would need, actually:
root@perdulce:~# /usr/lib/nagios/plugins/dsa-check-packages OK: 622 ok, 7 rc 622 packages current. 7 packages removed but not purged: linux-image-6.1.0-13-amd64, linux-image-6.1.0-16-amd64, linux-image-6.1.0-20-amd64, linux-image-6.1.0-17-amd64, linux-image-6.1.0-15-amd64, linux-image-6.1.0-18-amd64, linux-image-6.1.0-12-amd64 |obs_loc=0;1;5;0 outdated=0;1;5;0 current=622;;;0 obs_ign=0;;;0 rm_unprg=7;;;0 hold=0;;;0 prg_conf=0;1;;0
Obviously, all the DNSSEC stuff is an issue as well, but it's an open question for me whether we want to keep DNSSEC at all. In any case, those should also be investigated, and in fact, maybe this issue could be where we investigate all those "priority E" checks:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#priority-e-to-review
... which are:
Check Exporter Rule level Note dsa_check_soas_add
??? warning checks that zones are in sync on secondaries dsa-check-zone-rrsig-expiration-many
[dnssec-exporter][] warning TODO, drop DNSSEC? dsa-check-zone-signature-all
??? warning idem dsa-check-dnssec-delegation
??? warning idem "DNS - key coverage" ??? warning idem, dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage
on nevii, could be converted as is"DNS - DS expiry" ??? warning idem, dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds
on neviicheck_ntp_time
node warning unclear how that differs from check_ntp_peer
- anarcat mentioned in issue #40695 (closed)
mentioned in issue #40695 (closed)
- anarcat added time estimate of 40h
added time estimate of 40h
- anarcat mentioned in commit wiki-replica@7eddf801
mentioned in commit wiki-replica@7eddf801
i started working on this, to scratch an itch that was "wait, why do we have 10 pending alerts here". turns out those are totally fine and we shouldn't worry about them. example output:
$ fab -H pauli.torproject.org host.all-pending-upgrades --query='ALERTS{alertname="PackagesPendingTooLong"}' INFO: found 10 hosts with pending upgrades: dal-node-01.torproject.org dal-node-02.torproject.org dal-node-03.torproject.org fsn-node-02.torproject.org fsn-node-03.torproject.org fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-08.torproject.org pauli.torproject.org INFO: loading package lists from hosts WARNING: found 1 pending upgrades out of 701 packages on host dal-node-02.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 708 packages on host dal-node-03.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 708 packages on host dal-node-01.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 615 packages on host pauli.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 721 packages on host fsn-node-03.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 743 packages on host fsn-node-02.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 698 packages on host fsn-node-06.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 716 packages on host fsn-node-04.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 692 packages on host fsn-node-08.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) WARNING: found 1 pending upgrades out of 704 packages on host fsn-node-05.torproject.org: needrestart (3.5-4+deb11u3, 3.7-2~tpo1) INFO: outdated packages across infrastructure: needrestart (3.5-4+deb11u3, 3.7-2~tpo1)
cross-referenced from the runbook linked from the alert.
i think we might even be done here, amazingly, as that was basically the only thing missing from our list. we can extend the check once we have more alerts, of course, so i'm keeping this open to cover for that.
- anarcat mentioned in commit fabric-tasks@6d0dd5ba
mentioned in commit fabric-tasks@6d0dd5ba
- anarcat marked this issue as related to #41770 (closed)
marked this issue as related to #41770 (closed)
- anarcat mentioned in issue #41770 (closed)
mentioned in issue #41770 (closed)
- anarcat changed the description
changed the description
- anarcat marked this issue as related to #41774 (closed)
marked this issue as related to #41774 (closed)
- anarcat mentioned in issue #41774 (closed)
mentioned in issue #41774 (closed)
- anarcat changed the description
changed the description
- anarcat changed the description
changed the description
- anarcat added Next label and removed Roadmap::Future label
added Next label and removed Roadmap::Future label
marked the checklist item port
dsa-check-backuppg
to prometheus or replace with alternative (#40950), note that there's already a cron job for expiration, separate from the NRPE check, in/etc/cron.d/tor-backup-postgres
, deployed by puppet as completedmarked the checklist item port
dsa-check-backuppg
to prometheus or replace with alternative (#40950), note that there's already a cron job for expiration, separate from the NRPE check, in/etc/cron.d/tor-backup-postgres
, deployed by puppet as incomplete- anarcat marked the checklist item adapt
dsa-check-backuppg
to output Prometheus metrics, as a stopgap measure as completedmarked the checklist item adapt
dsa-check-backuppg
to output Prometheus metrics, as a stopgap measure as completed - anarcat changed the description
changed the description
grep for
/usr/lib/nagios/plugins
ordsa[-_]
in all cron jobs and in all our source codeso, quick smoke check:
anarcat@angela:~$ cumin-all "grep -r -e /usr/lib/nagios/plugins -e 'dsa[-_]' /etc/cron* /var/spool/crontabs" 91 hosts will be targeted: alberti.torproject.org,anonticket-01.torproject.org,archive-01.torproject.org,backup-storage-01.torproject.org,bacula-director-01.torproject.org,btcpayserver-02.torproject.org,bungei.torproject.org,carinatum.torproject.org,cdn-backend-sunet-02.torproject.org,check-01.torproject.org,chives.torproject.org,ci-runner-x86-[02-03].torproject.org,colchicifolium.torproject.org,collector-02.torproject.org,crm-ext-01.torproject.org,crm-int-01.torproject.org,dal-node-[01-03].torproject.org,dal-rescue-[01-02].torproject.org,dangerzone-01.torproject.org,donate-01.torproject.org,donate-review-01.torproject.org,eugeni.torproject.org,forum-01.torproject.org,fsn-node-[01-08].torproject.org,gayi.torproject.org,gitlab-02.torproject.org,henryi.torproject.org,hetzner-hel1-[01-03].torproject.org,hetzner-nbg1-[01-02].torproject.org,idle-fsn-01.torproject.org,loghost01.torproject.org,mandos-01.torproject.org,materculae.torproject.org,media-01.torproject.org,meronense.torproject.org,metrics-store-01.torproject.org,metricsdb-01.torproject.org,minio-01.torproject.org,neriniflorum.torproject.org,nevii.torproject.org,ns[3,5].torproject.org,onionbalance-02.torproject.org,onionoo-backend-[01-03].torproject.org,onionoo-frontend-[01-02].torproject.org,palmeri.torproject.org,pauli.torproject.org,perdulce.torproject.org,polyanthum.torproject.org,probetelemetry-01.torproject.org,puppetdb-01.torproject.org,rdsys-frontend-01.torproject.org,rdsys-test-01.torproject.org,relay-01.torproject.org,rude.torproject.org,ssh-dal-01.torproject.org,static-gitlab-shim.torproject.org,static-master-fsn.torproject.org,staticiforme.torproject.org,submit-01.torproject.org,survey-01.torproject.org,tb-build-[02-03,06].torproject.org,tb-pkgstage-01.torproject.org,tb-tester-01.torproject.org,tbb-nightlies-master.torproject.org,telegram-bot-01.torproject.org,vault-01.torproject.org,weather-01.torproject.org,web-dal-[07-08].torproject.org,web-fsn-[01-02].torproject.org OK to proceed on 91 hosts? Enter the number of affected hosts to confirm or "q" to quit: 91 ===== NODE GROUP ===== (1) puppetdb-01.torproject.org ----- OUTPUT of 'grep -r -e /usr/...r/spool/crontabs' ----- bind [127.0.0.1]:8080: Address already in use channel_setup_fwd_listener_tcpip: cannot listen to port: 8080 Could not request local forwarding. /etc/cron.d/puppet-crontab:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins /etc/cron.d/tor-nagios-checks:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status /etc/cron.d/tor-nagios-checks:13 */4 * * * root [ -x /usr/sbin/dsa-update-samhain-status ] && /usr/sbin/dsa-update-samhain-status grep: /var/spool/crontabs: No such file or directory ===== NODE GROUP ===== (50) anonticket-01.torproject.org,archive-01.torproject.org,backup-storage-01.torproject.org,carinatum.torproject.org,cdn-backend-sunet-02.torproject.org,chives.torproject.org,ci-runner-x86-03.torproject.org,colchicifolium.torproject.org,collector-02.torproject.org,dal-node-01.torproject.org,dal-rescue-02.torproject.org,dangerzone-01.torproject.org,donate-01.torproject.org,donate-review-01.torproject.org,eugeni.torproject.org,forum-01.torproject.org,fsn-node-[03,06].torproject.org,gayi.torproject.org,gitlab-02.torproject.org,henryi.torproject.org,hetzner-hel1-01.torproject.org,hetzner-nbg1-01.torproject.org,idle-fsn-01.torproject.org,materculae.torproject.org,meronense.torproject.org,metrics-store-01.torproject.org,metricsdb-01.torproject.org,minio-01.torproject.org,neriniflorum.torproject.org,nevii.torproject.org,ns[3,5].torproject.org,onionoo-backend-[01-03].torproject.org,onionoo-frontend-02.torproject.org,polyanthum.torproject.org,probetelemetry-01.torproject.org,rdsys-frontend-01.torproject.org,relay-01.torproject.org,ssh-dal-01.torproject.org,static-master-fsn.torproject.org,staticiforme.torproject.org,tb-build-[02,06].torproject.org,tb-tester-01.torproject.org,web-dal-08.torproject.org,web-fsn-[01-02].torproject.org ----- OUTPUT of 'grep -r -e /usr/...r/spool/crontabs' ----- /etc/cron.d/puppet-crontab:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins /etc/cron.d/tor-nagios-checks:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status /etc/cron.d/tor-nagios-checks:13 */4 * * * root [ -x /usr/sbin/dsa-update-samhain-status ] && /usr/sbin/dsa-update-samhain-status grep: /var/spool/crontabs: No such file or directory ===== NODE GROUP ===== (1) bacula-director-01.torproject.org ----- OUTPUT of 'grep -r -e /usr/...r/spool/crontabs' ----- /etc/cron.d/puppet-crontab:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins /etc/cron.d/puppet-crontab:*/3 * * * * root sleep $(( $RANDOM \% 60 )); flock -w 0 -e /usr/local/sbin/dsa-bacula-scheduler /usr/local/sbin/dsa-bacula-scheduler /etc/cron.d/tor-nagios-checks:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status /etc/cron.d/tor-nagios-checks:13 */4 * * * root [ -x /usr/sbin/dsa-update-samhain-status ] && /usr/sbin/dsa-update-samhain-status grep: /var/spool/crontabs: No such file or directory ===== NODE GROUP ===== (1) loghost01.torproject.org ----- OUTPUT of 'grep -r -e /usr/...r/spool/crontabs' ----- /etc/cron.d/tor-nagios-checks:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status /etc/cron.d/tor-nagios-checks:13 */4 * * * root [ -x /usr/sbin/dsa-update-samhain-status ] && /usr/sbin/dsa-update-samhain-status /etc/cron.d/puppet-crontab:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins /etc/cron.daily/puppet-handle-loghost-logs:# This file is maintained in dsa-puppet grep: /var/spool/crontabs: No such file or directory ===== NODE GROUP ===== (1) bungei.torproject.org ----- OUTPUT of 'grep -r -e /usr/...r/spool/crontabs' ----- /etc/cron.d/tor-nagios-checks:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status /etc/cron.d/tor-nagios-checks:13 */4 * * * root [ -x /usr/sbin/dsa-update-samhain-status ] && /usr/sbin/dsa-update-samhain-status /etc/cron.d/puppet-crontab:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins /etc/cron.d/tor-backup-postgres:20 2 * * 1 torbackup chronic /usr/lib/nagios/plugins/dsa-check-backuppg -e grep: /var/spool/crontabs: No such file or directory ===== NODE GROUP ===== (37) alberti.torproject.org,btcpayserver-02.torproject.org,check-01.torproject.org,ci-runner-x86-02.torproject.org,crm-ext-01.torproject.org,crm-int-01.torproject.org,dal-node-[02-03].torproject.org,dal-rescue-01.torproject.org,fsn-node-[01-02,04-05,07-08].torproject.org,hetzner-hel1-[02-03].torproject.org,hetzner-nbg1-02.torproject.org,mandos-01.torproject.org,media-01.torproject.org,onionbalance-02.torproject.org,onionoo-frontend-01.torproject.org,palmeri.torproject.org,pauli.torproject.org,perdulce.torproject.org,rdsys-test-01.torproject.org,rude.torproject.org,static-gitlab-shim.torproject.org,submit-01.torproject.org,survey-01.torproject.org,tb-build-03.torproject.org,tb-pkgstage-01.torproject.org,tbb-nightlies-master.torproject.org,telegram-bot-01.torproject.org,vault-01.torproject.org,weather-01.torproject.org,web-dal-07.torproject.org ----- OUTPUT of 'grep -r -e /usr/...r/spool/crontabs' ----- /etc/cron.d/tor-nagios-checks:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status /etc/cron.d/tor-nagios-checks:13 */4 * * * root [ -x /usr/sbin/dsa-update-samhain-status ] && /usr/sbin/dsa-update-samhain-status /etc/cron.d/puppet-crontab:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins grep: /var/spool/crontabs: No such file or directory ================ PASS | | 0% (0/91) [00:28<?, ?hosts/s] FAIL |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (91/91) [00:28<00:00, 3.22hosts/s] 100.0% (91/91) of nodes failed to execute command 'grep -r -e /usr/...r/spool/crontabs': alberti.torproject.org,anonticket-01.torproject.org,archive-01.torproject.org,backup-storage-01.torproject.org,bacula-director-01.torproject.org,btcpayserver-02.torproject.org,bungei.torproject.org,carinatum.torproject.org,cdn-backend-sunet-02.torproject.org,check-01.torproject.org,chives.torproject.org,ci-runner-x86-[02-03].torproject.org,colchicifolium.torproject.org,collector-02.torproject.org,crm-ext-01.torproject.org,crm-int-01.torproject.org,dal-node-[01-03].torproject.org,dal-rescue-[01-02].torproject.org,dangerzone-01.torproject.org,donate-01.torproject.org,donate-review-01.torproject.org,eugeni.torproject.org,forum-01.torproject.org,fsn-node-[01-08].torproject.org,gayi.torproject.org,gitlab-02.torproject.org,henryi.torproject.org,hetzner-hel1-[01-03].torproject.org,hetzner-nbg1-[01-02].torproject.org,idle-fsn-01.torproject.org,loghost01.torproject.org,mandos-01.torproject.org,materculae.torproject.org,media-01.torproject.org,meronense.torproject.org,metrics-store-01.torproject.org,metricsdb-01.torproject.org,minio-01.torproject.org,neriniflorum.torproject.org,nevii.torproject.org,ns[3,5].torproject.org,onionbalance-02.torproject.org,onionoo-backend-[01-03].torproject.org,onionoo-frontend-[01-02].torproject.org,palmeri.torproject.org,pauli.torproject.org,perdulce.torproject.org,polyanthum.torproject.org,probetelemetry-01.torproject.org,puppetdb-01.torproject.org,rdsys-frontend-01.torproject.org,rdsys-test-01.torproject.org,relay-01.torproject.org,rude.torproject.org,ssh-dal-01.torproject.org,static-gitlab-shim.torproject.org,static-master-fsn.torproject.org,staticiforme.torproject.org,submit-01.torproject.org,survey-01.torproject.org,tb-build-[02-03,06].torproject.org,tb-pkgstage-01.torproject.org,tb-tester-01.torproject.org,tbb-nightlies-master.torproject.org,telegram-bot-01.torproject.org,vault-01.torproject.org,weather-01.torproject.org,web-dal-[07-08].torproject.org,web-fsn-[01-02].torproject.org
so the big thing in there is
puppet-crontab
which has:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/nagios/plugins
which is really kind of despicable because now we have to check for all paths that could be in /usr/lib/nagios/plugins in our cron jobs anywhere, urgh.
otherwise, we find, as expected:
-
bungi
:/etc/cron.d/tor-backup-postgres:20 2 * * 1 torbackup chronic /usr/lib/nagios/plugins/dsa-check-backuppg -e
, to keep -
nevii
:/var/spool/cron/crontabs/dnsadm:14 */4 * * * chronic bin/dsa-check-dnssec-coverage-all-nagios-wrap
and/var/spool/cron/crontabs/dnsadm:24 */4 * * * chronic bin/dsa-check-and-extend-DS
, to keep - all hosts:
/etc/cron.d/tor-nagios-checks:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status
(safe to drop)
that
tor-nagios-checks
file can be safely deleted on all hosts, and currently consists of:@hourly root [ -x /usr/sbin/dsa-update-apt-status ] && /usr/sbin/dsa-update-apt-status 13 */4 * * * root [ -x /usr/sbin/dsa-update-samhain-status ] && /usr/sbin/dsa-update-samhain-status
that would be done as part of the tor-nagios-checks retirement.
next step is to actually check for side-effects in NRPE checks.
-
- anarcat marked the checklist item grep for
/usr/lib/nagios/plugins
ordsa[-_]
in all cron jobs and in all our source code as completedmarked the checklist item grep for
/usr/lib/nagios/plugins
ordsa[-_]
in all cron jobs and in all our source code as completed - anarcat changed the description
changed the description
- anarcat marked the checklist item audit all active checks for side effects as completed
marked the checklist item audit all active checks for side effects as completed
- anarcat mentioned in commit fabric-tasks@b00892b5
mentioned in commit fabric-tasks@b00892b5
- anarcat mentioned in commit wiki-replica@2611a872
mentioned in commit wiki-replica@2611a872
- anarcat marked this issue as related to prometheus-alerts#16
marked this issue as related to prometheus-alerts#16
- anarcat changed the description
changed the description
- Resolved by anarcat
in #40695 (closed) i did a big cleanup of the puppet codebase, removing all the NRPE/nagios stuff. the only thing remaining is the
tor-nagios-check
package install, which I made configurable through a class parameter. i propose setting it up to "purge" everywhere but the two hosts, then flip the switch so that only those two hosts include the class. 4 replies Last reply by anarcat
- anarcat marked the checklist item remove
tor-nagios-checks
from all servers apart from exceptions, which arebungei
andnevii
as completedmarked the checklist item remove
tor-nagios-checks
from all servers apart from exceptions, which arebungei
andnevii
as completed - anarcat changed the description
changed the description
- anarcat marked the checklist item archive
tor-nagios-checks
repository as completedmarked the checklist item archive
tor-nagios-checks
repository as completed - anarcat changed the description
changed the description
- anarcat changed the description
changed the description
- anarcat marked the checklist item
followup in #41639 (closed) as completedcheck_ntp_time
: unclear how that differs fromcheck_ntp_peer
marked the checklist item
followup in #41639 (closed) as completedcheck_ntp_time
: unclear how that differs fromcheck_ntp_peer
- anarcat added Roadmap::Future label and removed Doing label
added Roadmap::Future label and removed Doing label
- anarcat mentioned in issue #41639 (closed)
mentioned in issue #41639 (closed)
- anarcat changed title from replace remaining tor-nagios-checks diagnostic tools with fabric to retire tor-nagios-checks package
changed title from replace remaining tor-nagios-checks diagnostic tools with fabric to retire tor-nagios-checks package
- anarcat changed the description
changed the description
- anarcat removed the relation with prometheus-alerts#16
removed the relation with prometheus-alerts#16
- anarcat marked this issue as related to #41816 (closed)
marked this issue as related to #41816 (closed)
- anarcat changed the description
changed the description
- anarcat mentioned in commit repos@8a26c0ec
mentioned in commit repos@8a26c0ec
- anarcat changed the description
changed the description
- anarcat mentioned in issue #41816 (closed)
mentioned in issue #41816 (closed)