Rewrite the censorship detector used by the Tor Metrics website in Java
The censorship detector written by George Danezis in 2011 is the only part of the Tor Metrics website that is written in Python. We should consider rewriting it in Java in order to integrate it more closely into the rest of the Tor Metrics website code. This is also related to legacy/trac#19754 (moved).
iwakeh, want to comment on whether this makes sense or not, before somebody else comes and picks this up?
(The following thoughts depend on whether we reach consensus in the metrics team that this is even a good idea.)
The first step of this rewrite should be to create a minimal setup of the Python file that doesn't require setting up an own instance of the Tor Metrics website. I'll attach a compressed version of the input file
userstats-detector.csv to this ticket. Running the Python version should be as simple as downloading that attachment and the two Python files
country_info.py from metrics-web's clients module and running:
unxz userstats-detector.csv.xz python detector.py
That command should run for a few minutes and produce a couple of files including
userstats-ranges.csv, which is the only output file we care about:
date,country,minusers,maxusers 2011-09-08,a1,559.698186453,1399.64885163 2011-09-09,a1,469.497090181,1451.46081727 2011-09-11,a1,639.857484235,1457.19233381 2011-09-12,a1,597.260782974,1312.46735446 [...]
Step two could be to throw out any unused code that is not required to produce this output file. Ideally, this would happen in one or more separate commits.
Step three would be to look at required external dependencies to rewrite the remaining code in Java. I haven't looked at all at this yet, so maybe this is doable without adding external dependencies, which would be best. But if external dependencies are necessary, maybe there's something in Apache Commons that we can use here. In any case, adding external dependencies requires discussion on this ticket.
Step four would be to do the rewrite and to try out that it produces roughly the same results (we're cutting off decimal places, for example). There's a guide on coding style here.
Step five would be to review the new code and integrate it into metrics-web.
All in all, I could imagine that steps 1 to 4 might be an interesting task for a new volunteer. Optimistically adding the
But let's first discuss whether this rewrite makes sense, or whether there's a better plan to do it!