README.md


# tor_fusion: parsing tor network data efficiently in rust.

tor_fusion is a project to parse [Tor](https://www.torproject.org/) network
documents in the [Rust](https://www.rust-lang.org/) programming language.

## Links:

This is the README for the tor_fusion project as a whole.
If you want find more practical information regarding parsing Tor network documents
and extracting metrics you might want to check out these links:

  * [Tor Metrics website](https://metrics.torproject.org/)

  * [Docs on reproducible metrics](https://metrics.torproject.org/reproducible-metrics.html)

  * [DescriptorParser](https://gitlab.torproject.org/tpo/network-health/metrics/descriptorParser/)
  a java app to parse and store Tor network documents.

  * [Metrics Library](https://gitlab.torproject.org/tpo/network-health/metrics/library)
  a java library to parse Tor network documents

  * [Collector](https://collector.torproject.org) an archive of data from
  various nodes and services in the public Tor network.


## Why rewrite how network documents are parsed?

The data analysis community is evolving and moving to different tools then
when the metrics pipeline was first developed.

At the same time we have way more data produced from nodes and services on the
public Tor network then when metrics started as a project in Tor.

We are in the process of [restructuring our pipeline](https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/collector/pipeline)
so that is easier to maintain over time, but also so that we are able to offer
better resources to our community and process data more efficiently.

Rust stands out as a practical choice for processing Tor network metrics, due to
 its performance and security features.

Rust offers efficiency when handling large datasets. Additionally, developers
can use an array of libraries explicitly designed for data analysis to
streamline data processing that is not specific to Tor, offering more
flexibility to researchers.


## What documents are supported?

We are currently only parsing onionperf analysis files. The long term plan is to
embed [Arti](https://gitlab.torproject.org/tpo/core/arti/) to download and parse
all types of documents produced by the various network nodes and services.

## Deployment on Tor Project machines

tor_fusion is deployed on [metricsdb-01.torproject.org](https://db.torproject.org/machines.cgi?host=metricsdb-01) machine
via puppet and runs alongside descriptorParser on metricsdb.

When new code is merged into main it gets deployed automagically and built on the
machine directly.

The scripts used to build and run tor fusions are also deployed via puppet from the
[metrics-bin](https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/metricsdb/tor_fusion/) 
repository.

## Run

First build tor_fusion via cargo:

```
$ cargo build --release
```

Latest tested versions are:
rustc 1.77.0 (aedd173a2 2024-03-17)
cargo 1.77.0 (3fe68eabf 2024-02-29)

You need to configure a postgresql DB to load the data into via a config.toml file. You can check the example provided in this
repository: config.toml.example

The tables needed by tor_fusion can be checked from the [metrics-sql-tables](https://gitlab.torproject.org/tpo/network-health/metrics/metrics-sql-tables/-/blob/main/onionperf_tables.sql?ref_type=heads)
repository.


Then:
```
# decompress the onionperf analysis file:
$ xz -d onionperf-analysis.json.xz
# run the binary against the json file:
$ ./target/release/tor_fusion onionperf-analysis.json 
```