TPA-RFC-67: retire mininag

marked this issue as related to #41633 (closed)

changed milestone to %TPA-RFC-33-A: emergency Icinga retirement

assigned to @anarcat

added Next label and removed Backlog label

added Doing label and removed Next label

in wiki-replica@bce4ed00, i've analyzed how mini-nag actually works. in general, we might be able to get away with keeping some Nagios plugins (monitoring-plugins-basic package, specifically) installed on the main Nagios server except for the shutdown check, which is done over NRPE to all affected hosts.

Given the way the script works, we could also completely rewrite it to replace it with a check that taps into Prometheus metrics instead. The script is short and has a fairly well defined interface (namely: the status directory).

Failing that, we could also simply stop using the mininag script altogether and just accept that mirrors have outages like any other, and instead rely on web browser behaviors to fall back on secondary servers when one server in rotation doesn't answer properly.

Will think this over.

mentioned in commit wiki-replica@bce4ed00

added Needs Review label and removed Doing label

so it seems we can't have nice things and we'll have to port at least parts of mini-nag to our brave new world.

i think the simplest solution would be to keep the monitoring plugins package on nevii, and then add another check that would ping prometheus. it would require a one-line patch to mini-nag to add the check, then writing the actual check that would:

fetch the status from a prometheus server, with fallback (ironically, probably implementing at least parts of RFC8305 once we have HA)
if the prometheus server is unavailable, return success
if it is available, return success if the host is not pending a reboot

finally, we'll need something that writes a metric for prometheus to scrape when there is indeed a pending reboot. possible a node_shutdown_scheduled_timestamp metric that would be added by mollyguard and reset on reboot.

finally, we'll need something that writes a metric for prometheus to scrape when there is indeed a pending reboot. possible a node_shutdown_scheduled_timestamp metric that would be added by mollyguard and reset on reboot.

i tried doing this through molly-guard, but it doesn't work because molly-guard runs before the shutdown command is issued (not after), so it doesn't know about the reboot time yet.

i filed https://github.com/prometheus/node_exporter/issues/3110 against the node exporter to see if it could be done there.

in there, you can see the script i wrote to export metrics when called, but there's nowhere to call it. i'm trying to use dbus activation to run it now.

added Doing label and removed Needs Review label

after discussing this on IRC, i think we're converging towards a mini-nag retirement for now.

we'll have to hack at this through fabric right now anyways, even the skilled prometheus folks couldn't figure out how to patch the node exporter easily, it seems like the upstream wrapper around dbus needs to be patched for this to work properly (https://github.com/coreos/go-systemd/issues/447#issuecomment-2332011113).

so for now the plan is:

disable the mini-nag cron job
hack at fabric to drop "status" files in auto-dns and refresh DNS when rotating mirrors (or just ignore this entirely)
possibly, eventually, rewrite mininag to pull at the prometheus server for availability

The impact on this is that users would see outages while visiting torproject.org during reboots, possibly lasting a few minutes. We're assuming users already see such outages because our reboot procedures are not actually dilligent enough at following timeouts to make this work...

So basically, we need another issue here to fix our reboot procedures in whatever way, and another issue to restore the self-healing system in the brave new non-NRPE world.

marked this issue as related to #40695 (closed)

mentioned in commit wiki-replica@fdd777ea

proposed this policy change in TPA-RFC-67, waiting for objections until wednesday.

changed due date to September 11, 2024

added Needs Review label

removed Doing label

changed title from port mininag to prometheus or retire to TPA-RFC-67: port mininag to prometheus or retire

changed the description

marked the checklist item evaluate possibility of replacing or rewriting mininag (not practical) as completed

marked the checklist item propose retirement (https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-67-retire-mini-nag) as completed

marked this issue as related to #41750 (closed)

added RFC label

added Next label and removed Needs Review label

changed title from TPA-RFC-67: port mininag to prometheus or retire to TPA-RFC-67: retire mininag

marked the checklist item wait for approval as completed

changed due date to September 17, 2024

got swamped by donate-neo this week, will revisit next.

added Doing label and removed Next label

marked the checklist item disable cron job as completed

disabled cron job in the dnsadm user:

# run mini-nag every two minutes to catch down servers
#*/2 * * * * chronic bin/update-mini-nag

mentioned in commit fabric-tasks@895d18f9

mentioned in merge request fabric-tasks!3 (closed)

marked this issue as related to #41766 (closed)

mentioned in issue #41766 (closed)

i'm going to take out the "improve reboots" task in another issue. we're swamped, and i just don't see this happening with an easy fix in the short term, see #41766 (closed).

we're done here, mininag is retired, and we have a plan to improve availability going forward (#41766 (closed), #41670 (closed)).

closed

mentioned in commit wiki-replica@df400a08

mentioned in commit wiki-replica@1ed555f9

mentioned in issue #41811 (closed)

marked this issue as related to #41811 (closed)

marked this issue as related to #41670 (closed)

TPA-RFC-67: retire mininag

Designs

Child items ...

Activity