The Exit DNS Timeout Problem
Since we added the DNS timeout overload line in relays, it has been popping on the majority of Exits now. The current parameterse in 0.4.7.2-alpha are that over 10 minutes, if 1% of all DNS queries timeout, it triggers that line.
Looking at the top 10 list of Exits, almost half of them are overloaded likely due to DNS timeouts: https://metrics.torproject.org/rs.html#search/flag:exit%20
There are two stories from two Exit relay operators that contacted us about this problem and helped out chase down the problem. I'll go in details with one operator's story and make a note about the second operator.
AndersTrier
This operator is a well known Exit operator based in .dk and has a large set of Exits. Last week, he showed up with almost 5% DNS timeouts reported by his Exits.
The setup here is that the tor exit node sends its DNS queries to a local Unbound server and so we were able to get a lot of information from Unbound.
The average resolving time was around 9.8 seconds but with a median of 0.07 seconds. In otherwords, 50% of the queries were normal timing below a second but it appears that 5% were so big that they brought the average to almost 10 seconds.
Anders was able to see that anything resolving to the .by
or .ua
would simply get no response for 4.5 minutes (apparently some default in Unbound before dumping the query).
So, he switched the Unbound server on another IP that is not a Tor IP. The situation got better with roughly 1% to 1.5% timeouts over the last days.
toralf
As for toralf, he saw roughly the same problem, DNS timeouts go up to 7% with the same setup that is Unbound in front. One other thing that is a bit weird though is that he experimented by emptying its Exit policy to no ports and so DNS queries would stop. Then, he would open these:
ExitPolicy accept *:8074 # Gadu-Gadu
ExitPolicy accept *:11371 # OpenPGP hkp (http keyserver protocol)
ExitPolicy accept *:64738 # Mumble
And an hour later (likely the consensus getting around with the Exit policy), a flood of DNS requests would arrive to various domains but unrelated domains to these ports like "facebook.com" for "mail.gmail.com".
Intriguing that such requests would end up on those ports and in such numbers so quickly.
Observations
-
It appears that Tor IPs are getting censored at various DNS levels which were confirmed with ccTLD.
-
Our 1% threshold is likely too low and so we should bump it but "to what" number seems complicated due to "if 20% of your queries go to .by in that 10 minutes, you are overloaded".
-
Seems one solution is to propose operators to put their Unbound on an unrelated Tor IP. This can be difficult as IPv4 are getting scarce and thus expensive...
https://community.torproject.org/relay/setup/exit/
In my opinion, we need to assess the DNS situation on our side and likely on a systematic level. In other words, I think we have to run Exit(s) here and conduct experiment and measurements along with Unbound.
We should also likely have scanners in place that query various domains and ccTLDs in order to learn the state of DNS censorship for Tor users.
Finally, the "overload" DNS timeout threshold should likely be raised but to what value is still unclear to me.