Create a tool to detect issues in the bandwidth files given their key/values

Bandwidth files contain structured key/value data describing relays and their measured bandwidth over time. However, there is currently no standalone tool that systematically validates these files for logical inconsistencies or suspicious values.

It would be useful to create a tool that parses bandwidth files and reports potential issues based on the contained key/value pairs.

Example checks:

A relay has been seen in fewer consensuses than expected given other fields.
Inconsistencies between “known” and “measured” metrics.
Issues similar to those previously tracked in legacy/trac#29954 (moved) and related child tickets.
Potential incorporation of logic from legacy/trac#30735 (moved).

Why this matters

Bandwidth files influence how relays are weighted in the Tor network. If inconsistencies or anomalies go undetected:

Relay weights may be inaccurate.
Measurement data may silently degrade.
Bugs in the measurement pipeline may go unnoticed.
Historical regressions may reappear without warning.

Having a validation tool improves:

Debuggability
Data quality assurance
CI integration for measurement pipelines
Transparency for operators and developers

Goals

Build a standalone tool (CLI preferred) that:

Parses bandwidth files. (Can leverage existing services, like descriptorParser
Extracts relay-level key/value entries.
Runs a series of validation rules.
Outputs a structured report of detected issues.

The tool should be usable:

Manually (CLI invocation),
In CI pipelines,
For regression detection.

Example Validation Rules

Consensus Count Consistency

If:

consensus_count (or similar field) indicates a relay was known for N consensuses,

But other fields suggest fewer observations, report:

Relay <fingerprint> seen in fewer consensuses than expected.
Expected: X
Observed: Y

Missing Required Keys

If required keys are absent for a relay entry: flag as error or warning.

Example:

Relay <fingerprint> missing required key: bw

Invalid Value Ranges

Examples:

Negative bandwidth values.
Timestamps in the future.
Zero values where not allowed.
Percentiles outside valid ranges.

Cross-Field Logical Inconsistencies

Examples:

measured_at timestamp older than published_at.
Relay marked as measured but missing measurement result.
Relay marked “unmeasured” but has non-zero bandwidth.

Historical/Legacy Checks

Reimplement or integrate checks previously discussed in:

legacy/trac#29954 (moved) and child tickets
legacy/trac#30735 (moved)

The goal is to prevent known classes of data issues from resurfacing.

Edited Feb 24, 2026 by Hiro