Rewrite CollecTor relaydescs module using Stem/txtorcon
The CollecTor service collects and archives data from various nodes and services in the public Tor network. Internally, it consists of several modules that are running in the background following a pre-defined schedule. These modules either download data from other hosts or process data that has been copied from other hosts to the local file system. The processed data is then provided via a locally running static web server.
CollecTor is written in Java. It uses several APIs either provided in the JDK or in third-party libraries. For example, it uses
java.util.concurrent for scheduling. However, it does not use a specific framework for batch processing. That is why it has to solve challenges like the following on its own:
- Scheduling: Make sure modules are running, say, once per hour; avoid overlapping runs.
- Dependencies: Make sure that module runs don't interfere with each other; one module writes newly obtained files to disk, another tars them up, yet another writes an index file and provides that to external applications.
- Shutdowns: Handle externally triggered shutdowns gracefully and make sure the service resumes operation after reboot, without missing data.
These are just a few examples, and CollecTor does not resolve all of them in the best way possible. It also feels like somebody must have solved these challenges before. We should find out, and the best way is probably to try it out in practice.
In Mexico City we decided to evaluate existing batch processing frameworks by rewriting the CollecTor relaydescs module using Python with Stem or txtorcon. It should be sufficient to make it work for at least consensuses and server descriptors as initial proof of concept. Other descriptor types can follow later, if we decide to switch from Java to Python for CollecTor.
The first steps are to write down requirements and possible Python libraries for the batch-processing parts.
We're done with this task when we have a working prototype of CollecTor in Python that fetches consensuses and server descriptors from the directory authorities.