DNSSEC outage
First diagnostic
earlier today, i noticed torproject.org didn't resolve while trying to push to puppet.tpo.
15:14:26 <anarcat> dig torproject.org says: ; EDE: 9 (DNSKEY Missing): (No
DNSKEY matches DS RRs of torproject.org)
15:14:29 <anarcat> ;; SERVER: 8.8.4.4#53(8.8.4.4) (UDP)
15:15:59 <anarcat> i also get failures when i bypass my local DNS in firefox
(with the dns over https stuff)
When did this start?
It's not exactly clear. Prometheus first noticed the issue, according to IRC, at 18:17, but this is just when the alert went out, it likely started failing before this.
The old DS record:
torproject.org. IN DS 54250 8 2 a62fcc38294b2beb923450d3b4da37811f6c8296c800a990400cb4e8d7193e63; Pub: 2023-07-03 17:40:10; Act: 2023-07-03 17:40:10; Inact: 2025-11-04 17:40:10; Del: 2025-11-04 17:40:10; Rev: 2025-09-20 17:40:10
was set to expire at 2025-09-20 17:40:10, 30 minutes before that time.
Current status
The DS keys were updated at joker.com at 2025-09-20 19:32UTC, and service is recording progressively, depending on the caches (TTL=1h).
Roles
- Command: @anarcat
Next steps
-
post-mortem
Post-mortem
Executive summary
A spurious DNSSEC key rotation lead to a complete outage of all domain name resolution services which rendered all of our services unavailable for a few hours on September 20th 2025. Key rotations will be retired and monitoring of the DNS infrastructure will be improved in response.
- Affected users: all users using DNSSEC resolvers, includes Google (
8.8.8.8
) and Cloudflare resolvers (1.1.1.1
) - Duration:
- for
torproject.org
: about 4 hours, from 2025-09-20 17:40 UTC to 21:30UTC - for
.net
and.com
: about 24 hours, starting about the same time
- for
- Status page link: https://status.torproject.org/issues/2025-09-20-dnssec-outage/
- Report Status: finished
Outage Description
The DS
record in our zone was updated automatically by in-house scripts but never propagated to our registrar, which meant we were serving zones with a new key that wasn't in the root nameservers.
That, in turn, invalidated all verifying DNS queries and led to a global outage for users relying on DNSSEC-enforcing resolvers.
Timeline
See the GitLab timeline.
Root cause analysis
- Problem 1: automated key rotations requires manual work
- the key rotation wasn't fixed in our registrar in a timely manner, because of problem 2
- Problem 2: deficient DNSSEC monitoring
- normally, key rotations are propagated upstream after noticing Nagios alert
- those alerts were not ported to Prometheus (#41794 (closed))
- we knew about the problem, but misunderstood the
dsset
fields, interpreting theRev
("revocation"?) date as the target date, while the record were rotated on theInactive
date, more than a month earlier
What went well?
- this happened over a weekend, which wasn't too disruptive for workers
- the issue was detected by @anarcat quickly
Recommendations
- disable automatic key rotation (see #42268 (closed) for followup)
- improve DNSSEC monitoring (#41794 (closed))