Commit 9fb8e636 authored by Hiro's avatar Hiro 🏄
Browse files

Initialize webstats parser

parents
Loading
Loading
Loading
Loading

.gitignore

0 → 100644
+1 −0
Original line number Diff line number Diff line
/target

Cargo.lock

0 → 100644
+0 −0

File added.

Preview size limit exceeded, changes collapsed.

Cargo.toml

0 → 100644
+17 −0
Original line number Diff line number Diff line
[package]
name = "tornado"
version = "0.1.0"
edition = "2024"

[dependencies]
anyhow = "1"
clickhouse = "0.12"
regex = "1"
serde = { version = "1", features = ["derive"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
xz2 = "0.1"
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "gzip", "brotli", "deflate", "stream"] }
serde_json = "1"

[dev-dependencies]
tempfile = "3"

README.md

0 → 100644
+62 −0
Original line number Diff line number Diff line
## Overview

This service ingests Tor Project weblog archives, aggregates request
statistics, and writes the summary into a ClickHouse table.  Files are expected
to be xz-compressed access logs whose names look like
`www.torproject.org_web-fsn-01.torproject.org_access.log_20250101.xz`.  Each run
scans the configured directory, decompresses any previously unseen files, counts
unique `(path, status)` pairs, and stores the totals in ClickHouse.  Processed
files are tracked in `state.json` so they are not ingested twice; the state is
kept inside the same directory as the logs and trimmed automatically.

## Configuration

| Variable              | Default                | Description                                      |
| --------------------- | ---------------------- | ------------------------------------------------ |
| `WEBSTATS_DIR`        | `./data/store/webstats`| Directory containing `.xz` log files and state   |
| `WEBSTATS_TABLE`      | `webstats`             | ClickHouse table that receives aggregates        |
| `CLICKHOUSE_URL`      | `http://localhost:8123`| Base URL for the ClickHouse HTTP interface       |
| `CLICKHOUSE_DB`       | `default`              | Database/schema used for inserts                 |
| `CLICKHOUSE_USER`     | unset                  | Optional username for ClickHouse authentication  |
| `CLICKHOUSE_PASSWORD` | unset                  | Optional password for ClickHouse authentication  |

## Running

1. Ensure the destination ClickHouse table exists (for example, with columns
   `url String`, `webserver String`, `date String`, `path String`,
   `status UInt16`, `hits UInt64`).
2. Populate the log directory with the `.xz` files to process. You can let the
   helper script fetch fresh archives automatically:

   ```bash
   WEBSTATS_DIR=/path/to/webstats ./bin/download_webstats.sh
   ```

   Existing state is read from `<WEBSTATS_DIR>/state.json`; if it is missing a
   fresh one is created automatically.
3. Run the service:

   ```bash
   WEBSTATS_DIR=/path/to/webstats \
   WEBSTATS_TABLE=webstats \
   CLICKHOUSE_URL=http://localhost:8123 \
   cargo run --release
   ```

During execution the service prints progress as it discovers files, skips any
that failed to parse, and finally reports how many aggregated rows were inserted
into ClickHouse.

## Development

Most of the logic lives in `src/weblogs.rs`, which exposes helpers for parsing
log files.  `src/main.rs` wires those helpers into the ClickHouse writer and the
state tracker defined in `src/state.rs`.

Useful commands:

```bash
cargo fmt
cargo clippy
cargo test
```
+24 −0
Original line number Diff line number Diff line
#!/usr/bin/env bash
set -euo pipefail

BASE_URL="https://collector.torproject.org/recent/webstats"
DEST_DIR=${WEBSTATS_DIR:-"./data/store/webstats"}

mkdir -p "$DEST_DIR"
TMP_DIR=$(mktemp -d)
trap 'rm -rf "$TMP_DIR"' EXIT

curl -fsSL "$BASE_URL/" > "$TMP_DIR/index.html"

grep -o 'href="[^"]*\.xz"' "$TMP_DIR/index.html" \
  | sed -E 's/href="([^"]+)"/\1/' \
  | while read -r fname; do
      if [ -e "$DEST_DIR/$fname" ]; then
          continue
      fi
      echo "Downloading $fname"
      curl -fSL "$BASE_URL/$fname" -o "$TMP_DIR/$fname"
      mv "$TMP_DIR/$fname" "$DEST_DIR/$fname"
    done

echo "Done downloading new archives into $DEST_DIR"