retire hetzner-hel1-01 (nagios/icinga)
Nagios is going to be a particularly tricky bullseye upgrade, so it's not part of the large bullseye upgrade batches (#40690 (closed) or #40692 (closed)).
We need to decide whether we keep icinga around at all or replace it with Prometheus (#29864 (closed)). if we do keep icinga, we need to decide whether we keep the current "push to git to rebuild the config" model or "puppetize the setup" (#32901 (closed)). We decided to retire Icinga, see TPA-RFC-33.
-
announcement: TPA-RFC-33 should cover most needs, but a week before the server shutdown, do a reminder, particularly for network-health if they still have checks in there -
compare nagios alerts with prometheus to make sure we are not missing anything (see also #41713 (closed) for an audit), make sure the person in rotation keeps an eye for both dashboards for a couple of weekswe'll compare history instead, spun off in #41791 (closed) -
nagios: (!!) this step can be removed from the docs! will be automated in the new Prometheus world, whoohoo! -
replace "active" NRPE checks (those with side effects) with cron jobs, see also #41671, to check: -
dsa-check-backuppg
, double-checked in #41774 (closed), requires tor-nagios-checks installed on bungei, to keep until replacement (#40950 -
dsa-check-bacula
, double-checked, no side effects -
dsa-check-cert-expire
, double-checked, no side effects -
dsa-check-mirrorsync
, double-checked, no side effects -
dsa-check-udldap-freshness
, double-checked, no side effects -
dsa-check-filesystems
, double-checked, no side effects -
dsa-check-unbound-anchors
, double-checked, no side effects -
dsa_check_soas_add
-
dsa-check-zone-rrsig-expiration-many
-
dsa-check-zone-signature-all
-
dsa-check-dnssec-delegation
-
dsa-check-statusfile
-
others, extract list from list in #41671
-
-
retire the host in fabric, do NOT destroy the VM before an extensive delay, say 30 days, keep backups for longer as well, say a year -
remove from LDAP with ldapvi
-
power-grep (partial, see #41816 for followup) -
remove from tor-passwords -
remove from DNSwl -
remove from docs: this bit will be particularly tricky as we probably reference nagios/icinga everywhere, see also #41655 (partial, see #41816 for followup) -
remove from racks (moved to a separate ticket, see #41817 (closed)) -
remove from reverse DNSN/A will go away with the server's destruction in #41817 (closed) -
flush tor-nagios-checks from db.torproject.org once the tasks have been rewritten in fabric, see also https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#pager-playbook-responses - should probably be moved to its own issue, as this might be okay in phase Bmoved to #41671 -
consider opening another ticket to replace mininag with prometheus once we reach high availabilitydone, see #41670 (closed) - extra cruft found:
-
redirect #tor-nagios to tor-alerts on irc -
cleanup NRPE stuff from puppet -
archive tor-nagios.git -
nagios.tpo TLS certs (includes dehydrated?)
-
Edited by anarcat