title: TPA-RFC-67: Retire mini-nag
costs: N/A
approval: TPA
affected users: global
deadline: 2024-09-11
status: obsolete
discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
Summary: retire mini-nag, degradation in availability during unplanned outages expected
Background
mini-nag is a bespoke script that runs every two minutes on the primary DNS server. It probes the hosts backing the mirror system (defined in the auto-dns repository) to check if they are unavailable or pending a shutdown and, if so, takes them out of the DNS rotation.
To perform most checks, it uses checks from the
monitoring-plugins repository (essentially Nagios checks), ran
locally (e.g. check_ping
, check_http
) except the shutdown check,
which runs over NRPE.
NRPE is going to be fully retired as part of the Nagios retirement (tpo/tpa/team#40695) and this will break the shutdown checks.
In-depth static code analysis of the script seem to indicate it might also be vulnerable to catastrophic failure in case of a partial network disturbance on the primary DNS server, which could knock off all mirrors off line.
Note that mini-nag (nor Nagios?) did not detect a critical outage (tpo/tpa/team#41672) until it was too late. So current coverage of this monitoring tool is flawed, at best.
Proposal
Disable the mini-nag cron job on the primary DNS server (currently
nevii
) to keep it from taking hosts out of rotation altogether.
Optionally, modify the fabric-tasks
reboot job to post a "flag file"
in auto-dns to take hosts out of rotation while performing reboots.
This work will start next week, on Wednesday September 11th 2024, unless an objection is raised.
Impact
During unplanned outages, some mirrors might be unavailable to users, causing timeouts and connection errors, that would need manual recovery from TPA.
During planned outages, if the optional fabric-tasks modification isn't performed, similar outages could occur for a couple of minutes while the hosts reboot.
Normally, RFC8305 ("Happy Eyeballs v2") should mitigate such situations, as it prescribes an improved algorithm for HTTP user agents to fallback through round robin DNS records during such outages. Unfortunately, our preliminary analysis seem to indicate low adoption of that standard, even in modern browsers, although the full extent of that support is still left to be determined.
At the moment, our reboot procedures are not well tuned enough to mitigate such outages in the first place. Our DNS TTL is currently at one hour, and we would need to wait at least that delay during rotations to ensure proper transitions, something we're currently not doing anyways.
So we estimate impact to be non-existent from the current procedures, in normal operating conditions.
Alternatives considered
We've explored the possibility of hooking up mini-nag to Prometheus, so that it takes hosts out of rotation depending on monitored availability.
This has the following problems:
-
it requires writing a new check to probe Prometheus (moderately hard) and patching mini-nag to support it (easy)
-
it requires patching the Prometheus node exporter to support shutdown metrics (hard, see node exporter issue 3110) or adding our own metrics through the fabric job
-
it carries forward a piece of legacy infrastructure, with its own parallel monitoring system and status database, without change
A proper solution would be to rewrite mini-nag with Prometheus in mind, after the node exporter gets support for this metric, to properly monitor the mirror system and adjust DNS accordingly.