Include reverse DNS lookup results in details
We should run reverse DNS lookups and include their results in details documents. What's the best way to run these lookups in Java? Also, do we have to run them every hour for every relay?
I wrote a simple Java application that looks up host names using the following code line:
InetAddress.getByName(address).getHostName()
The application also measures how long each lookup took. I ran it for the first 1000 relays in the consensus published on 2012-02-18 at 03:00:00. Here are some simple statistics:
Min. 1st Qu. Median Mean 3rd Qu. Max.
114.0 688.8 1032.0 1906.0 1628.0 81120.0
So, looking up all 2759 relays in the consensus would have taken about 1.5 hours. There's no way for sequentially looking up reverse DNS entries for all relays in a consensus every hour. We'll need to make some optimizations before even starting. Questions are:
-
Is there a faster way to look up reverse DNS entries than the one used in this simple Java application?
-
Can we group multiple lookups and make a single request for them?
-
How often do we need to refresh a reverse DNS lookup result? In theory we could cache results for an arbitrary time, but would they still be accurate after 3, 6, 12, 24 hours?
-
How many requests can we make in parallel using Java threads? The Java side is easy and probably doesn't eat too much CPU time, but would we trigger some mechanism at our ISP when we make 100 requests at a time?
Here are some comments after talking to George and Damian:
-
An average lookup time of 1.9 seconds per request isn't that unlikely.
-
Using a thread pool with 5 lookup threads should be a fine start.
-
Caching results for 12 hours should work fine. It's much more likely that a relay IP address changes than that the host name changes. We could also keep some simple statistics how often host names actually change when looking them up; if the fraction is higher than we'd like it to be, we can still reduce the caching period to 6 hours or less. We should document in protocol.html how often host names are looked up.
-
Performing multiple lookups per request would be cool, but is probably not supported by Java libraries.
-
I re-ran the analysis above, but this time with the
host
tool instead of Java. Results are much lower, so there must be something going on in Java which slows down the lookup. More research needed.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0320 0.1800 0.3780 0.4252 0.5420 12.0300
(This was issue 7 in my GitHub repository.)