Implement basic usage statistics in obfsproxy
We should implement some basic usage statistics in obfsproxy to learn about usage as long as Tor doesn't have support for obfsproxy statistics (#5040 (moved)). Once Tor supports these statistics, the implementation in obfsproxy can be removed. Both Tor's and obfsproxy's statistics should be equivalent or at least easily comparable.
The idea is to have obfsproxy log incoming connections in a privacy-aware way and provide a simple script to convert these logs into a format that can be published without issues. Bridge operators can periodically run the script and send the output to the Tor developers who publish and analyze them. The implementation in obfsproxy should be quite simple in order not to break too much stuff. The conversion script should be dead simple, so that bridge operators can understand what's going on.
Here's a possible approach:
We want to count daily connections by country and daily unique IP addresses by country. Similar to other statistics in Tor, we want to aggregate data over 24-hour periods, resolve IP addresses to country codes, and round up frequencies to multiples of 8.
-
When obfsproxy starts, it does three things: a) generate a secret string S that it only keeps in memory; b) note the timestamp TS when it started; c) create a buffer B with a capacity of 100 log messages.
-
Whenever obfsproxy receives a client connection, it runs steps 3 to 5:
-
It checks whether at least 24 hours have passed since TS. If so, it flushes all log messages from buffer B, shuffles them, and appends them to a file on disk. It also increments TS in 24-hour steps until TS is not more than 24 hours in the past.
-
It checks whether B is full, i.e., contains 100 messages. If so, it flushes B and appends messages to a file on disk in random order.
-
It creates a new log message containing a) timestamp TS (which is NOT the current timestamp!), b) the country code of the connecting IP as resolved by a GeoIP database, c) the hashed IP address using secret S, i.e.,
H(IP || S)
with a cryptographic hash function of the implementor's choice. An example log message would be"2012-02-07 14:01:04 de 1234567890123456789012345678901234567890"
. -
When obfsproxy stops, it does NOT flush the contents of B to disk. It forgets about S, possibly in a cryptographically secure manner.
The buffer has two functions here. First, it removes the original order of connections, which may still be meaningful if it contains connections from countries with few connections. Second, the buffer protects the timing of single client connections that occur when obfsproxy is terminated and restarted shortly after a 24-hour interval ends. The buffer size of 100 was arbitrarily chosen to avoid memory problems on heavily used bridges. Higher numbers are preferred, but if that makes things more complicated, 100 should be a large enough number.
The log messages still reveal too much information to be published. They shouldn't contain IP hashes, and frequencies still need to be rounded up to the next multiple of 8. The following bash script, which probably requires a lot more comments, converts a log message file into a format that can be published by bridge operators.
echo "Daily rounded total requests by country"
cut -d" " -f1-3 data | sort | uniq -c | \
awk '{printf "%s %s %s %d\n", $2, $3, $4, 8*(int(($1+7)/8))}'
echo "Daily rounded unique IPs by country"
sort data | uniq | cut -d" " -f1-3 | uniq -c | \
awk '{printf "%s %s %s %d\n", $2, $3, $4, 8*(int(($1+7)/8))}'
Note that the approach taken here was designed to keep the changes to obfsproxy small. Of course, we could implement everything in obfsproxy and write nice files that bridge operators can mail to the Tor devs directly. That would be an implementation similar to what Tor does for the various statistics. The buffered logging approach seemed to be a good compromise between not logging sensitive data and not adding too much code. Whether that is true is a question for the obfsproxy developers.