DNSSEC outage
First diagnostic
earlier today, i noticed torproject.org didn't resolve while trying to push to puppet.tpo.
15:14:26 <anarcat> dig torproject.org says: ; EDE: 9 (DNSKEY Missing): (No 
                   DNSKEY matches DS RRs of torproject.org)
15:14:29 <anarcat> ;; SERVER: 8.8.4.4#53(8.8.4.4) (UDP)
15:15:59 <anarcat> i also get failures when i bypass my local DNS in firefox 
                   (with the dns over https stuff)
When did this start?
It's not exactly clear. Prometheus first noticed the issue, according to IRC, at 18:17, but this is just when the alert went out, it likely started failing before this.
The old DS record:
torproject.org.  IN DS 54250 8 2 a62fcc38294b2beb923450d3b4da37811f6c8296c800a990400cb4e8d7193e63; Pub: 2023-07-03 17:40:10;  Act: 2023-07-03 17:40:10;  Inact: 2025-11-04 17:40:10;  Del: 2025-11-04 17:40:10;  Rev: 2025-09-20 17:40:10was set to expire at 2025-09-20 17:40:10, 30 minutes before that time.
Current status
The DS keys were updated at joker.com at 2025-09-20 19:32UTC, and service is recording progressively, depending on the caches (TTL=1h).
Roles
- Command: @anarcat
Next steps
- 
post-mortem 
Post-mortem
Executive summary
A spurious DNSSEC key rotation lead to a complete outage of all domain name resolution services which rendered all of our services unavailable for a few hours on September 20th 2025. Key rotations will be retired and monitoring of the DNS infrastructure will be improved in response.
- Affected users: all users using DNSSEC resolvers, includes Google (8.8.8.8) and Cloudflare resolvers (1.1.1.1)
- Duration:
- for torproject.org: about 4 hours, from 2025-09-20 17:40 UTC to 21:30UTC
- for .netand.com: about 24 hours, starting about the same time
 
- for 
- Status page link: https://status.torproject.org/issues/2025-09-20-dnssec-outage/
- Report Status: finished
Outage Description
The DS record in our zone was updated automatically by in-house scripts but never propagated to our registrar, which meant we were serving zones with a new key that wasn't in the root nameservers.
That, in turn, invalidated all verifying DNS queries and led to a global outage for users relying on DNSSEC-enforcing resolvers.
Timeline
See the GitLab timeline.
Root cause analysis
- Problem 1: automated key rotations requires manual work
- the key rotation wasn't fixed in our registrar in a timely manner, because of problem 2
 
- Problem 2: deficient DNSSEC monitoring
- normally, key rotations are propagated upstream after noticing Nagios alert
- those alerts were not ported to Prometheus (#41794 (closed))
- we knew about the problem, but misunderstood the dssetfields, interpreting theRev("revocation"?) date as the target date, while the record were rotated on theInactivedate, more than a month earlier
 
What went well?
- this happened over a weekend, which wasn't too disruptive for workers
- the issue was detected by @anarcat quickly
Recommendations
- disable automatic key rotation (see #42268 (closed) for followup)
- improve DNSSEC monitoring (#41794 (closed))