Relay daemon ceases to service Tor Browser requests, timing out, when a local instance of 'unbound' is the DNS resolver and large numbers of DNS requests time-out.
Works fine when 'named' is swapped in place of 'unbound'.
GoDaddy DNS stops responding when large numbers of queries are submitted and this was observed as the particular trigger.
To reproduce, configure the SOA+NS records for several thousand dummy domains to point to a non-responding IP, then generate large numbers of requests against them.
I understand the stuff that is triggering this is GoDaddy's policy to block high volume DNS requesting IP addresses. This will happen regardless of what application is making those DNS requests (unbound, named, etc.) so why exactly is unbound's fault here?
If it's something in unbound and only in unbound, we should patch it upstream since it's not Tor related. If it's something in the Tor daemon which interacts with unbound in such a broken way we should patch, but can you please explain more?
Affects tor running against Unbound. After putting named up the GoDaddy block remains in effect even now. Exit ranked about 60 works perfectly with named and becomes absolutely unusable with Unbound in place. To my eyes that is a major severity bug (though perhaps not high priority). Fairly obvious the problem is tor daemon's handing of some quirk of timed-out request failures done the Unbound way. My guess is nothing wrong with Unbound.
Am the second operator to experience this problem and fix it with named.
Have debug level trace of about 3-5 minutes of the relay-unresponsive state with known test queries being made via TBB. Too big to put here, will upload to another service if someone asks.
Trac: Status: needs_information to new Severity: Normal to Major
One obvious difference is that named times-out requests in ten seconds and does not fall-back to TCP (with DNSSEC disabled), at least when querying via dig.
Unbound tries for 120 seconds, falling back to TCP. Researched unbound time-outs and did not find a way to modify the behavior.
So to summarize, it sounds like unbound's behavior when doing a dns resolve is more aggressive than named's behavior? And Godaddy has some sort of abuse detection mechanism that makes it refuse to answer dns questions from loud IP addresses? And whatever unbound is doing is more often triggering godaddy's mechanism?
And while some people on tor-relays thought that this was maybe a Tor bug, it can't be a Tor bug if the issue is "the dns server you're asking questions to won't answer"? Or is there still a Tor bug here too, where Tor should handle it better when it doesn't get any dns answer?
So to summarize, it sounds like unbound's behavior when doing a dns resolve is more aggressive than named's behavior?
It appears that Unbound is more persistent than named, but employes a sophisticated exponential back-off scheme so I'm not sure it would be considered more aggressive. The above documentation link goes into the unbound time-out scheme at great length. Named appears to have a much simpler and shorter retry/timeout approach.
And Godaddy has some sort of abuse detection mechanism that makes it refuse to answer dns questions from loud IP addresses?
In 2011 GoDaddy implemented a policy of blocking high-volume DNS requesters in order to avoid adding resources to their DNS server pool. At one point this apparently included blocking GoogleBot. Appears to be a manually maintained list with an arbitrary selection policy. See
It appears that my Dhalgren relay was added to their block list three days ago and the 'ashtrayhat3' relay was added back in January. My relay continues to have DNS blocked by GoDaddy. Probably several other fast relays are blocked, but never ran with unbound and so it was not noticed.
And whatever unbound is doing is more often triggering godaddy's mechanism?
I doubt it's unbound (vs named) that caused GoDaddy to block DNS from my exit. They block high-volume DNS requesters in general. I also noticed the ed.gov is blocking my relay.
And while some people on tor-relays thought that this was maybe a Tor bug, it can't be a Tor bug if the issue is "the dns server you're asking questions to won't answer"? Or is there still a Tor bug here too, where Tor should handle it better when it doesn't get any dns answer?
I'm 80-90% sure it's a bug in the way the Tor daemon interacts with unbound's behavior w/r/t large numbers of timing-out DNS queries. Unbound appears to be perfectly fine with the situation when it occurs. Tor daemon DNS queries lock-up wholesale, thus preventing normal exit browsing behaivor. Tor daemon is fine with the GoDaddy DNS block when named is the intermediary--large numbers of request time-outs of GoDaddy domains continue unabated.
Data-transfer via circuits appears unaffected as the relay earned a 100% rating increase from the BWauths while it was in the broken state (running 20% of normal traffic load) for 37 hours.
Here's a small random selection of GoDaddy domains that were timing-out during the incident. Useful for checking whether a GoDaddy DNS block is in effect for a particular resolver. They all have NS records ending with .domaincontrol.com.
Is academic, but it looks like GoDaddy is blocking all traffic to their AS26496 network from my relay. Tested direct connecting to the raw IP of some GoDaddy hosted web-sites and also see nothing but timeouts. I was considering trying to configure named to pull NS domaincontrol.com domains from Google DNS, but clearly there's no point.
Spent some time accessing relays with control-channel commands. GoDaddy does not appear to be systematically blocking fast Tor relays. However the block stays once applied. 'ashtrayhat3' can resolve GoDaddy DNS queries because the operator pointed DNS at a regional resolver. But 'ashtrayhat3' still cannot connect to addresses in AS26496.
Judging by the 30000 GoDaddy domain lookup attempts found in the unbound cache on my relay, I suspect that 'ashtrayhat3' and 'Dhalgren' were selected by some abuser as preferred nodes in Wordpress login attacks against large numbers of small blog-sites hosted by GoDaddy. GoDaddy decided to null-route the relays because of the volume of abuse.
Tor relays should be able to withstand such blocking and at present it falls apart when unbound is the resolver.
If a fix for this issue is created, I will probably be able to test it for some months going forward due to the persistence of the null-routing.
Debug-level SafeLogging=1 trace taken while relay was in DNS-frozen state. During trace a failed attempt was made to access the news site known for having most efficient page design. One minute fifteen seconds. Also have a SafeLogging=0 trace if it turns out that would make a difference in analyzing the issue.
After some thought I realized that the dramatically different timeout behavior of Unbound relative to named might be an issue in other use cases. Posted to the unbound-users list about it and they responded with an insightful analysis. It seems possible that an Unbound compatibility feature / option might be created for applications that, like Tor, employ the eventdns component of libevent.
named /etc/resolv.conf
options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100
Possibly, even probably max-timeouts:1000000 is better but I'd like hear from a Tor developer on whether completely inhibiting the "down resolver" state in eventdns (even when the resolver is in fact down) is a good idea or not.
Unbound /etc/resolv.conf
options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100
named /etc/resolv.conf
options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100
Uhh, those 2 are the same. Just saying...
where it turns out that max-timeouts is capped at 255 by eventdns.c. Will create a patch to remove the 255 limit on next Tor daemon update. Only purpose for the "down resolver" state is to shift load to a different resolver, but in this situation that's undesirable. Have exactly one local resolver and if it fails an alarm goes off for manual attention.
I might also create an alarm that triggers when
unbound-control dump_requestlist
grows to more than 200 pending requests since that's what was observed during the relay failure. Shouldn't fail now but will be interesting to verify that and examine the next DNS potential DOS situation.
With Unbound running I did positively confirm that
time dig @127.0.0.1 +tries=1 +time=20
goes 20 seconds and quits with no response from the resolver where with 'named' the same command shows the resolver sending back a SERVFAIL at 10 seconds. In the above scenario 'unbound-control dump_requestlist' shows the requested domain for the 120 seconds that Unbound attempts to resolve it.
A downside to max-inflight:16384 is potential performance degradation of the primary Tor-process event loop due to linear queue searches for completed requests. The man(3) page for evdns states
Several algorithms require a full walk of the inflight queue and so bounding its size keeps thing going nicely under huge (many thousands of requests) loads.
My perspective is that the in-flight queue would only grow large under potential DOS scenarios such as the one described in this ticket and paying the performance cost of linear list searches is acceptable (on modern hardware) if it prevents exit relays from becoming unusable. An unlimited size Red-Black binary tree indexed work queue would be ideal, but would necessitate significant development effort to implement.
A mitigating factor is that when the in-flight queue grows large it will predominantly consist of entries that will never receive a response and will time-out in the order they appear on the queue. On the other hand, a majority of completed requests will appear at the very end of the list, suggesting a double-link structure and searching for completions from the end rather than beginning when the in-flight count exceeds a threshold.
After coming to understand the issue and tune for it, I created an alarm that triggers whenever the output from 'unbound-control dump_requestlist' command exceeds 200 entries (400 actually, as two DNS worker threads are configured and the command shows only thread 0). This has been hit twice now and today I managed to look at it while the event was still in progress.
This time instead of GoDaddy, the DNS enumeration target was a major Australian DNS provider and they, like GoDaddy, blocked DNS etc. traffic from the exit node.
The tuning appears to work as the node continued to be usable with good performance when "setconf ExitNodes=" is applied for testing on a client. I observe a 15% or so drop-off in traffic per the ISP utilization graph at the time of the alarm, with full recovery occurrng in about one hour. Hard to say whether the drop-off was the result of a performance impact from the DNS abuse or whether it's because the attacker gave up using the node just after I started looking--dump_requestlist began shrinking rapidly then.
In any case the node did vastly better than at the time of the original attack before the eventdns tuning was applied. At that time the relay was effectively taken down for two days. That attacker left their abuse-program running and didn't notice that GoDaddy had put and end to the scan.
A look at the utilization graph for the earlier alarm incident showed no performance impact, but that one occurred during off-peak hours when utilization runs significantly lower.
Sebastian points out that we are now experiencing this bug on many large Tor exit relays, in #21394 (moved).
So, ten points to Dhalgren for identifying and debugging it early. :)
Also, am I reading the above correctly, that evdns does not scale well? If so, that is a thing that we should be able to fix on the Tor and/or libevent side.
Sebastian points out that we are now experiencing this bug on many large Tor exit relays, in #21394 (moved).
So, ten points to Dhalgren for identifying and debugging it early. :)
Also, am I reading the above correctly, that evdns does not scale well? If so, that is a thing that we should be able to fix on the Tor and/or libevent side.
Thank you. Points gratefully accepted--is a pleasure when an extensive effort like this one proves valuable.
Unfortunate it took awhile for this ticket to connect with #21394 (moved), a ticket of which I was unaware though the problem of connection timeouts via top-tier relays has irritated me for months. Didn't cross my mind the cause might be one-in-the-same since one cannot trivially determine the resolver employed by an exit, and I believed others would discover this ticket and the documentation I added and correct for it. Is so severe I frequently consider adding the top 50-100 exits to ExcludeExitNodes.
Short term the recommended tuning is well worth the cost, but I reviewed the code and the performance burden of walking a request list with thousands of timing-out DNS queries is probably worth correcting. Red-black tree is of course the most versatile and resilient solution, but I observe support for double-linked lists was added to the the daemon core and implementing one as mentioned in comment:17 above addresses this case and may be expedient.
I placed the fifty-seven inoperable exits from the spreadsheet in #21394 (moved) on ExcludeExitNodes; the result a dramatic improvement in browsing experience. Of course browsing experience will improve further once this full 30% of stranded exit consensus weight becomes usable.
These tickets were marked as removed, and nobody has said that they can fix them. Let's remember to look at 033-removed-20180320 as we re-evaluate our triage process, to see whether we're triaging out unnecessarily, and to evaluate whether we're deferring anything unnecessarily. But for now, we can't do these: we need to fix the 033-must stuff now.
Trac: Milestone: Tor: 0.3.3.x-final to Tor: unspecified