The Metrics pipeline
At present, we are managing two distinct pipelines for our metrics service. Our legacy data collectors are still in operation, powering our stable metrics services, including:
Over the last few years, as outlined in Issue #40012, we have identified and addressed the challenges within the current metrics pipeline. We have also initiated the design of new tools for our work-in-progress pipeline (Version 2.0).
Key issues with our existing Metrics pipeline include:
- Processing descriptors at multiple points and multiple times.
- Heavy reliance on disk and network input/output operations for data processing.
These issues stem from the necessity to process descriptors and dispatch data to various destinations depending on their use, such as CSV files, databases, or small status files. This approach eventually results in overloading our metrics machines.
Recognizing the need for restructuring our current metrics pipeline and legacy code, we understand that this process must proceed incrementally, given that our metrics team currently consists of only one person.
Our starting point is to centralize the processing of descriptors and data extraction in one location. I propose that we do this within the collector while making use of our existing Java modules, including those from onionoo, collector, and the website.
By implementing this solution, we can reduce the intensity of input/output operations, as only one service will heavily rely on disk operations. Other services will query a database instead.
In parallel, we have also addressed the ongoing database issues on meronense, where some tables take an excessive amount of time to be queried, and thus require optimization.
In the future, we can explore integrating more efficient tools to reduce code maintenance. The industry currently utilizes batch and stream processing tools that could potentially replace some of our legacy Java code.
Architecture
The collector remains our primary service for collecting and archiving Tor network documents. Simultaneously, all data collected is processed and stored in two separate services:
- A PostgreSQL database for document archiving.
- A VictoriaMetrics instance for time series data.
descriptorParser is the Java application responsible for parsing all Tor network documents and storing them in our separate databases.
We have also developed the NetworkStatusAPI to enable querying the database and making time series data accessible through a RESTful web service, implemented in Rust using actix).
Over time, our goal is for this API to serve as the backbone for all our metrics services.
Ultimately, our long-term plan involves merging descriptorParser and Collector into a single service that can archive and serve Tor network documents. This data can be stored in the databases or provided as tarballs that can be independently downloaded and processed.
This effort has already begun with tor_fusion which is currently used exclusively to parse and store onionperf analysis JSON files. The idea is for tor_fusion to use arti to retrieve and process Tor network documents.
Data tables
The current tables used in PostgreSQL are managed through the metrics-sql-tables repository.
Time serie data
descriptorParser contains the list of metrics that are being mainteined in VictoriaMetrics.