O.1.3. Review and update how we store and serve historical data
We are currently parsing network data and storing it in tarballs on collector.torproject.org. In parallel, we parse and ingest recent data into metricsdb for analysis and API access.
To ensure our data infrastructure remains reliable, maintainable, and accessible over time, we should make a concrete plan around the following areas:
- Historical Data Storage and Archiving
Define how we want to organize and retain historical data in a way that balances storage efficiency, accessibility, and long-term sustainability. This includes format, directory structure, retention policy, and accessibility for future reprocessing.
- Recurrent Ingestion and Archiving
Establish a reliable strategy for continuously ingesting new data into metricsdb while maintaining an archive that mirrors what is ingested. This includes automation, failure recovery, and versioning considerations.
- Archiving Time Series Data
Determine how we want to store and maintain long-term time series data derived from descriptors, such as relay statuses, bridge activity, and network-level summaries. We should evaluate whether the current format and database schema meet performance and analysis needs, and whether a dedicated time-series database or format would be more appropriate.