Initialize webstats parser (9fb8e636) · Commits · The Tor Project / Network Health / Metrics / tornado

.gitignore

0 → 100644

+1 −0

Original line number	Diff line number	Diff line
		/target

Cargo.lock

0 → 100644

+1820 −0

File added.

Preview size limit exceeded, changes collapsed.

Cargo.toml

0 → 100644

+17 −0

Original line number	Diff line number	Diff line
		[package]
		name = "tornado"
		version = "0.1.0"
		edition = "2024"

		[dependencies]
		anyhow = "1"
		clickhouse = "0.12"
		regex = "1"
		serde = { version = "1", features = ["derive"] }
		tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
		xz2 = "0.1"
		reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "gzip", "brotli", "deflate", "stream"] }
		serde_json = "1"

		[dev-dependencies]
		tempfile = "3"

README.md

0 → 100644

+62 −0

Original line number	Diff line number	Diff line
		## Overview

		This service ingests Tor Project weblog archives, aggregates request
		statistics, and writes the summary into a ClickHouse table. Files are expected
		to be xz-compressed access logs whose names look like
		`www.torproject.org_web-fsn-01.torproject.org_access.log_20250101.xz`. Each run
		scans the configured directory, decompresses any previously unseen files, counts
		unique `(path, status)` pairs, and stores the totals in ClickHouse. Processed
		files are tracked in `state.json` so they are not ingested twice; the state is
		kept inside the same directory as the logs and trimmed automatically.

		## Configuration

		\| Variable \| Default \| Description \|
		\| --------------------- \| ---------------------- \| ------------------------------------------------ \|
		\| `WEBSTATS_DIR` \| `./data/store/webstats`\| Directory containing `.xz` log files and state \|
		\| `WEBSTATS_TABLE` \| `webstats` \| ClickHouse table that receives aggregates \|
		\| `CLICKHOUSE_URL` \| `http://localhost:8123`\| Base URL for the ClickHouse HTTP interface \|
		\| `CLICKHOUSE_DB` \| `default` \| Database/schema used for inserts \|
		\| `CLICKHOUSE_USER` \| unset \| Optional username for ClickHouse authentication \|
		\| `CLICKHOUSE_PASSWORD` \| unset \| Optional password for ClickHouse authentication \|

		## Running

		1. Ensure the destination ClickHouse table exists (for example, with columns
		`url String`, `webserver String`, `date String`, `path String`,
		`status UInt16`, `hits UInt64`).
		2. Populate the log directory with the `.xz` files to process. You can let the
		helper script fetch fresh archives automatically:

		```bash
		WEBSTATS_DIR=/path/to/webstats ./bin/download_webstats.sh
		```

		Existing state is read from `<WEBSTATS_DIR>/state.json`; if it is missing a
		fresh one is created automatically.
		3. Run the service:

		```bash
		WEBSTATS_DIR=/path/to/webstats \
		WEBSTATS_TABLE=webstats \
		CLICKHOUSE_URL=http://localhost:8123 \
		cargo run --release
		```

		During execution the service prints progress as it discovers files, skips any
		that failed to parse, and finally reports how many aggregated rows were inserted
		into ClickHouse.

		## Development

		Most of the logic lives in `src/weblogs.rs`, which exposes helpers for parsing
		log files. `src/main.rs` wires those helpers into the ClickHouse writer and the
		state tracker defined in `src/state.rs`.

		Useful commands:

		```bash
		cargo fmt
		cargo clippy
		cargo test
		```

bin/download_webstats.sh

0 → 100755

+24 −0

Original line number	Diff line number	Diff line
		#!/usr/bin/env bash
		set -euo pipefail

		BASE_URL="https://collector.torproject.org/recent/webstats"
		DEST_DIR=${WEBSTATS_DIR:-"./data/store/webstats"}

		mkdir -p "$DEST_DIR"
		TMP_DIR=$(mktemp -d)
		trap 'rm -rf "$TMP_DIR"' EXIT

		curl -fsSL "$BASE_URL/" > "$TMP_DIR/index.html"

		grep -o 'href="[^"]*\.xz"' "$TMP_DIR/index.html" \
		\| sed -E 's/href="([^"]+)"/\1/' \
		\| while read -r fname; do
		if [ -e "$DEST_DIR/$fname" ]; then
		continue
		fi
		echo "Downloading $fname"
		curl -fSL "$BASE_URL/$fname" -o "$TMP_DIR/$fname"
		mv "$TMP_DIR/$fname" "$DEST_DIR/$fname"
		done

		echo "Done downloading new archives into $DEST_DIR"