internal DNSSEC failures
First diagnostic
some servers (ssh-dal-01, presumably prom1) are failing to resolve DNS.
Current status
probably caused by retire tor-nagios-checks package (#41671 - closed) related changes, and unbound having a pinned key in Puppet.
fixed on ssh-dal-01, but prometheus1 likely still down, at least.
Roles
Next steps
-
check all hosts for resolution -
test the new key on a couple hosts -
update the key in Puppet (or remove pinning?) -
deploy new keys everywhere by hand (@groente) -
recheck resolution on all hosts (@anarcat, 32 hosts failing) -
redeploy keys -
wait a few minutes -
recheck resolution -
post-mortem
Post-mortem
Obscure legacy infrastructure led to an internal DNSSEC outage, distinct from the Saturday 20th DNSSEC outage (#42297 (closed)).
- Affected users: mostly internal staff
- Duration: about two hours, from some time before 2025-09-25 ~15:00UTC to 16:44UTC
- Report Status: finished
Timeline
A GitLab timeline was constructed.
Root cause analysis
There is a mechanism for unbound to automatically pick up on new DNSSEC keys. However, unlike the relatively short TTL of our DS records, this requires 30 days before the new key is actually trusted. If you're unaware of this and do a manual rollover and remove the old keys before this 30 day period, unbound will no longer trust the signatures from our nameservers.
What went well?
- outage was detected quickly and resolved as soon as possible
- @anarcat got to experiment with enhance incident response procedures (TPA-RFC-91) (#40421 - closed)
What could have gone better?
- this is our first incident managed with the new incident response framework, a test of the procedure, it seems to work well, but staff needs to agree with it and training
-
@groente couldn't reach
dal-rescue-02because it's on a special port - the outage triggered a storm of alerts in Prometheus: it seems the
EntireHostDownalert didn't fulfill role entirely:
Click to expand
11:43:56 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://wiki.torproject.org/ is unreachable via HTTPS CRITICAL!
11:43:59 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://karma2.torproject.org/ is unreachable via HTTPS CRITICAL!
11:44:02 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://review.torproject.net/ is unreachable via HTTPS CRITICAL!
11:44:07 -ALERTOR1:#tor-alerts- EntireHosterDown [firing] All probes towards hoster hetzner-fsn1 are failing
11:44:10 -ALERTOR1:#tor-alerts- JobDown [firing] Exporter job "mtail" on srs-dal-01.torproject.org:3903 is down
11:44:36 -ALERTOR1:#tor-alerts- JobDown [firing] Exporter job "minio-bucket" on minio-01.torproject.org:9000 is down
11:45:21 -ALERTOR1:#tor-alerts- JobDown [firing] Exporter job "minio-cluster" on minio-01.torproject.org:9000 is down
11:45:53 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on dal-node-03.torproject.org
11:45:53 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on metricsdb-01.torproject.org
11:46:25 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://bridges-email.torproject.org/ is unreachable via HTTPS CRITICAL!
11:47:28 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://pages.torproject.net/ is unreachable via HTTPS CRITICAL!
11:47:28 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://rdsys-frontend-01.torproject.org/ is unreachable via HTTPS CRITICAL!
11:47:28 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://tb-build-02.torproject.org/ is unreachable via HTTPS CRITICAL!
11:47:46 <anarchat> i'm going to silence issues
11:47:55 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://dockerhub-mirror.torproject.org/ is unreachable via HTTPS CRITICAL!
11:47:55 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://karma2.torproject.org/ is unreachable via HTTPS CRITICAL!
11:49:06 -ALERTOR1:#tor-alerts- JobDown [firing] Exporter job "mtail" on rdsys-frontend-01.torproject.org:3903 is down
11:49:06 -ALERTOR1:#tor-alerts- JobDown [firing] Exporter job "mtail" on srs-dal-01.torproject.org:3903 is down
11:50:53 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on bungei.torproject.org
11:50:53 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on dal-node-03.torproject.org
11:50:53 -ALERTOR1:#tor-alerts- SystemdFailedUnits [firing] Some systemd units are in failed state on metricsdb-01.torproject.org
11:52:28 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://pages.torproject.net/ is unreachable via HTTPS CRITICAL!
11:52:28 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://rdsys-frontend-01.torproject.org/ is unreachable via HTTPS CRITICAL!
11:52:28 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://survey.torproject.org/ is unreachable via HTTPS CRITICAL!
11:52:28 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://tb-build-02.torproject.org/ is unreachable via HTTPS CRITICAL!
11:56:25 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://metrics-api.torproject.org/ is unreachable via HTTPS CRITICAL!
11:56:25 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://bridges-email.torproject.org/ is unreachable via HTTPS CRITICAL!
11:57:55 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://test.crm.torproject.org/ is unreachable via HTTPS CRITICAL!
11:57:55 -ALERTOR1:#tor-alerts- HTTPSUnreachable [firing] Website https://dockerhub-mirror.torproject.org/ is unreachable via HTTPS CRITICAL!
11:58:48 -ALERTOR1:#tor-alerts- EntireHosterDown [firing] All probes towards hoster hetzner-fsn1 are failing
11:58:48 -ALERTOR1:#tor-alerts- EntireHosterDown [firing] All probes towards hoster hetzner-nbg1 are failing
11:58:48 -ALERTOR1:#tor-alerts- EntireHosterDown [firing] All probes towards hoster safespring are failing
notice how the EntireHostDown alert never fired for the quintex point of presence? lots of alerts emanated from there... so perhaps the check period and tolerance for EntireHostDown could be lowered to avoid this situation?
Recommendations and related issues
-
document the key rotation procedure and execute an actual to test the procedure, see #42309 -
tune the EntireHostDownalert, see #42222 (comment 3263786) -
consider a simpler ICMP check for HostDown(see #42313) -
correctly document dal-rescue-02's setup or fix the port forwarding issue, see #42310 -
hold off on doing further DNSSEC changes for the weekend -
consider not having custom trust-anchors in unbound: #42311