Develop a Java/Python API that wraps relay descriptor sources and provides unified access to them
Quite a few metrics tools are processing archived and current relay descriptors to provide aggregate statistics, make descriptor archives searchable, or monitor the Tor network. These tools have a non-trivial amount of code in common that imports relay descriptors from various sources. Copying code is bad. Let's write an API that all these metrics tools can use and that facilitates developing new tools.
Note that this API is different from existing Tor controller APIs which connect to a Tor's control port and provide descriptors that the Tor process knows about. The new API won't connect to a Tor control port (even though it would be possible, but it's not required), but it may read the cached descriptors from a Tor's data directory, along with importing relay descriptors from other sources. Of course, the two APIs can be combined, but there's also a reason for the API described here to exist separately. None of the metrics tools requires to control a Tor process.
There are two major sources for relay descriptors:
Local directories: We can read relay descriptors from the cached-* files of a local Tor data directory or from the output directory of the directory-archive script or metrics-db. Some of these local directories can grow quite large, so that we'll need an efficient way to exclude descriptors that we already know. Also, some files contained in these directories may contain multiple relay descriptors while others don't. We'll want to support an arbitrary number of local directories in the new API.
Directory authorities/mirrors: We can download relay descriptors from the directory authorities or directory mirrors via Tor's directory protocol. We should restrict downloads to the minimum and only download missing descriptors. We should also download compressed descriptors if possible. In some cases we're interested whether a directory authority serves a descriptor (e.g., consensus-health script). In most cases we want to set a timeout for downloading descriptors.
We should design the new API in a way that it's stateless with respect to different executions and that it doesn't have its own configuration. A tool that uses the API should first initialize the API by creating relay descriptor data sources and then requesting descriptors to process.
The following tools may use the new API once it's ready: metrics-db, the part of metrics-web that aggregates statistics, the ExoneraTor database, the relay search database, the consensus-health script, the descriptor-health script, and the basic monitoring infrastructure.