Skip to content

retire hetzner-hel1-01 (nagios/icinga)

Nagios is going to be a particularly tricky bullseye upgrade, so it's not part of the large bullseye upgrade batches (#40690 (closed) or #40692 (closed)).

We need to decide whether we keep icinga around at all or replace it with Prometheus (#29864 (closed)). if we do keep icinga, we need to decide whether we keep the current "push to git to rebuild the config" model or "puppetize the setup" (#32901 (closed)). We decided to retire Icinga, see TPA-RFC-33.

  • announcement: TPA-RFC-33 should cover most needs, but a week before the server shutdown, do a reminder, particularly for network-health if they still have checks in there
  • compare nagios alerts with prometheus to make sure we are not missing anything (see also #41713 (closed) for an audit), make sure the person in rotation keeps an eye for both dashboards for a couple of weeks we'll compare history instead, spun off in #41791 (closed)
  • nagios: (!!) this step can be removed from the docs! will be automated in the new Prometheus world, whoohoo!
  • replace "active" NRPE checks (those with side effects) with cron jobs, see also #41671, to check:
    • dsa-check-backuppg, double-checked in #41774 (closed), requires tor-nagios-checks installed on bungei, to keep until replacement (#40950 (closed)
    • dsa-check-bacula, double-checked, no side effects
    • dsa-check-cert-expire, double-checked, no side effects
    • dsa-check-mirrorsync, double-checked, no side effects
    • dsa-check-udldap-freshness, double-checked, no side effects
    • dsa-check-filesystems, double-checked, no side effects
    • dsa-check-unbound-anchors, double-checked, no side effects
    • dsa_check_soas_add
    • dsa-check-zone-rrsig-expiration-many
    • dsa-check-zone-signature-all
    • dsa-check-dnssec-delegation
    • dsa-check-statusfile
    • others, extract list from list in #41671
  • retire the host in fabric, do NOT destroy the VM before an extensive delay, say 30 days, keep backups for longer as well, say a year
  • remove from LDAP with ldapvi
  • power-grep (partial, see #41816 (closed) for followup)
  • remove from tor-passwords
  • remove from DNSwl
  • remove from docs: this bit will be particularly tricky as we probably reference nagios/icinga everywhere, see also #41655 (closed) (partial, see #41816 (closed) for followup)
  • remove from racks (moved to a separate ticket, see #41817 (closed))
  • remove from reverse DNS N/A will go away with the server's destruction in #41817 (closed)
  • flush tor-nagios-checks from db.torproject.org once the tasks have been rewritten in fabric, see also https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring#pager-playbook-responses - should probably be moved to its own issue, as this might be okay in phase B moved to #41671
  • consider opening another ticket to replace mininag with prometheus once we reach high availability done, see #41670 (closed)
  • extra cruft found:
    • redirect #tor-nagios to tor-alerts on irc
    • cleanup NRPE stuff from puppet
    • archive tor-nagios.git
    • nagios.tpo TLS certs (includes dehydrated?)
Edited by anarcat
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information