Design collector to retrieve Tor network documents efficiently
Collector is a java application that runs as a service on colchicifolium and collector-02. The service is configured via a config files where the administrator can chose which parts of collector run and how often.
Collectors has an internal scheduler which runs selected tasks at chose intervals.
The main function of collector is fetching tor network documents from directory authorities and archive them on collector.torproject.org. There recent documents are served without compression, and data older than 3 days is archived into monthly tarballs.
Data is fetched from directory authorities and synced also locally on colchicifolium via rdsys from bridgedb and the bridge authority.
Collector.rs should fetch data and archive documents in object storage in a /YYYY/MM/DD/ path structure. We might not want to create tarballs anymore as data on minio can be automatically compressed transparently for the app service.
We also do not plan to create an indexer for the files as we might leverage minio api for this. Ex the minio client already supports:
We might not want to store data in a tabular format at this point. It is fine to do that in the parser for now.