in wiki-replica@bce4ed00, i've analyzed how mini-nag actually works. in general, we might be able to get away with keeping some Nagios plugins (monitoring-plugins-basic package, specifically) installed on the main Nagios server except for the shutdown check, which is done over NRPE to all affected hosts.
Given the way the script works, we could also completely rewrite it to replace it with a check that taps into Prometheus metrics instead. The script is short and has a fairly well defined interface (namely: the status directory).
Failing that, we could also simply stop using the mininag script altogether and just accept that mirrors have outages like any other, and instead rely on web browser behaviors to fall back on secondary servers when one server in rotation doesn't answer properly.
so it seems we can't have nice things and we'll have to port at least parts of mini-nag to our brave new world.
i think the simplest solution would be to keep the monitoring plugins package on nevii, and then add another check that would ping prometheus. it would require a one-line patch to mini-nag to add the check, then writing the actual check that would:
fetch the status from a prometheus server, with fallback (ironically, probably implementing at least parts of RFC8305 once we have HA)
if the prometheus server is unavailable, return success
if it is available, return success if the host is not pending a reboot
finally, we'll need something that writes a metric for prometheus to scrape when there is indeed a pending reboot. possible a node_shutdown_scheduled_timestamp metric that would be added by mollyguard and reset on reboot.
finally, we'll need something that writes a metric for prometheus to scrape when there is indeed a pending reboot. possible a node_shutdown_scheduled_timestamp metric that would be added by mollyguard and reset on reboot.
i tried doing this through molly-guard, but it doesn't work because molly-guard runs before the shutdown command is issued (not after), so it doesn't know about the reboot time yet.
in there, you can see the script i wrote to export metrics when called, but there's nowhere to call it. i'm trying to use dbus activation to run it now.
after discussing this on IRC, i think we're converging towards a mini-nag retirement for now.
we'll have to hack at this through fabric right now anyways, even the skilled prometheus folks couldn't figure out how to patch the node exporter easily, it seems like the upstream wrapper around dbus needs to be patched for this to work properly (https://github.com/coreos/go-systemd/issues/447#issuecomment-2332011113).
so for now the plan is:
disable the mini-nag cron job
hack at fabric to drop "status" files in auto-dns and refresh DNS when rotating mirrors (or just ignore this entirely)
possibly, eventually, rewrite mininag to pull at the prometheus server for availability
The impact on this is that users would see outages while visiting torproject.org during reboots, possibly lasting a few minutes. We're assuming users already see such outages because our reboot procedures are not actually dilligent enough at following timeouts to make this work...
So basically, we need another issue here to fix our reboot procedures in whatever way, and another issue to restore the self-healing system in the brave new non-NRPE world.
i'm going to take out the "improve reboots" task in another issue. we're swamped, and i just don't see this happening with an easy fix in the short term, see #41766 (closed).
we're done here, mininag is retired, and we have a plan to improve availability going forward (#41766 (closed), #41670 (closed)).