DNSSEC outage

First diagnostic

earlier today, i noticed torproject.org didn't resolve while trying to push to puppet.tpo.

15:14:26 <anarcat> dig torproject.org says: ; EDE: 9 (DNSKEY Missing): (No 
                   DNSKEY matches DS RRs of torproject.org)
15:14:29 <anarcat> ;; SERVER: 8.8.4.4#53(8.8.4.4) (UDP)
15:15:59 <anarcat> i also get failures when i bypass my local DNS in firefox 
                   (with the dns over https stuff)

When did this start?

It's not exactly clear. Prometheus first noticed the issue, according to IRC, at 18:17, but this is just when the alert went out, it likely started failing before this.

The old DS record:

torproject.org.  IN DS 54250 8 2 a62fcc38294b2beb923450d3b4da37811f6c8296c800a990400cb4e8d7193e63; Pub: 2023-07-03 17:40:10;  Act: 2023-07-03 17:40:10;  Inact: 2025-11-04 17:40:10;  Del: 2025-11-04 17:40:10;  Rev: 2025-09-20 17:40:10

was set to expire at 2025-09-20 17:40:10, 30 minutes before that time.


Current status

The DS keys were updated at joker.com at 2025-09-20 19:32UTC, and service is recording progressively, depending on the caches (TTL=1h).

Roles

  • Command: @anarcat

Next steps

  • post-mortem

Post-mortem

Executive summary

A spurious DNSSEC key rotation lead to a complete outage of all domain name resolution services which rendered all of our services unavailable for a few hours on September 20th 2025. Key rotations will be retired and monitoring of the DNS infrastructure will be improved in response.

  • Affected users: all users using DNSSEC resolvers, includes Google (8.8.8.8) and Cloudflare resolvers (1.1.1.1)
  • Duration:
    • for torproject.org: about 4 hours, from 2025-09-20 17:40 UTC to 21:30UTC
    • for .net and .com: about 24 hours, starting about the same time
  • Status page link: https://status.torproject.org/issues/2025-09-20-dnssec-outage/
  • Report Status: finished

Outage Description

The DS record in our zone was updated automatically by in-house scripts but never propagated to our registrar, which meant we were serving zones with a new key that wasn't in the root nameservers.

That, in turn, invalidated all verifying DNS queries and led to a global outage for users relying on DNSSEC-enforcing resolvers.

Timeline

See the GitLab timeline.

Root cause analysis

  • Problem 1: automated key rotations requires manual work
    • the key rotation wasn't fixed in our registrar in a timely manner, because of problem 2
  • Problem 2: deficient DNSSEC monitoring
    • normally, key rotations are propagated upstream after noticing Nagios alert
    • those alerts were not ported to Prometheus (#41794 (closed))
    • we knew about the problem, but misunderstood the dsset fields, interpreting the Rev ("revocation"?) date as the target date, while the record were rotated on the Inactive date, more than a month earlier

What went well?

  • this happened over a weekend, which wasn't too disruptive for workers
  • the issue was detected by @anarcat quickly

Recommendations

  • disable automatic key rotation (see #42268 (closed) for followup)
  • improve DNSSEC monitoring (#41794 (closed))
Edited Sep 24, 2025 by anarcat
Assignee Loading
Time tracking Loading