Skip to content

Use SQLite rather than CSV for intermediate data representation

The bridge-extra-info and snowflake-stats programs output multiple CSV files for each input descriptor tarball. Each of these CSV files is one rectangular table, with the schema of the table annotated in the filename. For example, for the input tarball bridge-extra-infos-2022-10.tar.xz, we output the files:

bw_history/bridge-extra-infos-2022-10.bw_history.csv
country_reqs/bridge-extra-infos-2022-10.country_reqs.csv
dir_reqs/bridge-extra-infos-2022-10.dir_reqs.csv
transport_ips/bridge-extra-infos-2022-10.transport_ips.csv

And for snowflakes-2022-10.tar.xz, we output the files:

client_match/snowflakes-2022-10.client_match.csv
client_polls/snowflakes-2022-10.client_polls.csv
proxy_country/snowflakes-2022-10.proxy_country.csv
proxy_nat_type/snowflakes-2022-10.proxy_nat_type.csv
proxy_type/snowflakes-2022-10.proxy_type.csv

Rather than this pile of CSV files per input file, emit just one SQLite database file per input. So for bridge-extra-infos-2022-10.tar.xz we output

bridge-extra-infos/bridge-extra-infos-2022-10.sqlite3

and for snowflakes-2022-10.tar.xz we output

snowflakes/snowflakes-2022-10.sqlite3

The database files contain multiple tables, which correspond to the former CSV files. bridge-extra-infos SQLite files contain tables bw_history, country_reqs, dir_reqs, and transport_ips. snowflakes SQLite files contain tables client_match, client_polls, proxy_country, proxy_nat_type, and proxy_type.

$ sqlite3 -readonly bridge-extra-infos/bridge-extra-infos-2022-10.sqlite3 ".schema"
CREATE TABLE descriptors (
            id INTEGER PRIMARY KEY,
            published TEXT,
            fingerprint BLOB,
            nickname TEXT
        ) STRICT
    ;
CREATE TABLE intervals (
            id INTEGER PRIMARY KEY,
            type TEXT,
            begin TEXT,
            end TEXT,
            descriptor INTEGER,
            FOREIGN KEY(descriptor) REFERENCES descriptors(id)
        ) STRICT
    ;
CREATE TABLE dir_reqs (
            reqs INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE TABLE country_reqs (
            country TEXT,
            reqs REAL,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE TABLE transport_ips (
            transport TEXT,
            ips INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE TABLE bw_history (
            type TEXT,
            bytes INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE INDEX dir_reqs_interval ON dir_reqs(interval);
CREATE INDEX country_reqs_interval ON country_reqs(interval);
CREATE INDEX transport_ips_interval ON transport_ips(interval);
CREATE INDEX bw_history_interval ON bw_history(interval);
$ sqlite3 -readonly snowflakes/snowflakes-2022-10.sqlite3 ".schema"
CREATE TABLE intervals (
            id INTEGER PRIMARY KEY,
            type TEXT,
            begin TEXT,
            end TEXT
        ) STRICT
    ;
CREATE TABLE proxy_country (
            country TEXT,
            unique_ips INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE TABLE proxy_type (
            type TEXT,
            unique_ips INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE TABLE proxy_nat_type (
            nat_type TEXT,
            unique_ips INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE TABLE client_match (
            restricted_denied_count INTEGER,
            unrestricted_denied_count INTEGER,
            matched_count INTEGER,
            timeout_count INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE TABLE client_polls (
            country TEXT,
            rendezvous_method TEXT,
            num_polls INTEGER,
            interval INTEGER,
            FOREIGN KEY(interval) REFERENCES intervals(id) ON DELETE CASCADE
        ) STRICT
    ;
CREATE INDEX proxy_country_interval ON proxy_country(interval);
CREATE INDEX proxy_type_interval ON proxy_type(interval);
CREATE INDEX proxy_nat_type_interval ON proxy_nat_type(interval);
CREATE INDEX client_match_interval ON client_match(interval);
CREATE INDEX client_polls_interval ON client_polls(interval);

The begin/end timestamps of intervals, and (for bridge-extra-info descriptors) the published date, fingerprint, and nickname are factored into separate intervals and descriptors tables. Besides reducing the clutter of many intermediate CSV files, having descriptors and intervals tables that can be interpreted independently of the finer-grained information in the other tables is the stealth reason for this change. It will enable correct calculation of coverage in the cases where the broken-down information represents the intervals too sparsely (#3 (closed)).

The former intermediate CSV files are not needed after this, because downstream programs such as userstats-bridge-transport-multi and proxy-type read directly from the SQLite databases. To recover the equivalent of the former intermediate CSV files, join the tables together by the interval and descriptor ID. For example, for a transport_ips CSV:

$ sqlite3 -readonly bridge-extra-infos/bridge-extra-infos-2022-10.sqlite3 \
".mode csv" ".headers on" "\
SELECT published, hex(fingerprint) AS fingerprint, begin, end, nickname, transport, ips \
FROM transport_ips \
LEFT JOIN intervals ON interval = intervals.id \
LEFT JOIN descriptors ON descriptor = descriptors.id \
ORDER BY published\
" | head
published,fingerprint,begin,end,nickname,transport,ips
"2022-10-01 09:36:44",5481936581E23D2D178105D44DB6915AB06BFB7F,"2022-09-28 14:16:51","2022-09-29 14:16:51",flakey2,<OR>,12
"2022-10-01 09:36:44",5481936581E23D2D178105D44DB6915AB06BFB7F,"2022-09-28 14:16:51","2022-09-29 14:16:51",flakey2,snowflake,85020
"2022-10-01 11:18:25",5481936581E23D2D178105D44DB6915AB06BFB7F,"2022-09-30 09:37:19","2022-10-01 09:37:19",flakey5,<OR>,20
"2022-10-01 11:18:25",5481936581E23D2D178105D44DB6915AB06BFB7F,"2022-09-30 09:37:19","2022-10-01 09:37:19",flakey5,snowflake,54516
"2022-10-01 12:48:27",5481936581E23D2D178105D44DB6915AB06BFB7F,"2022-09-30 09:37:20","2022-10-01 09:37:20",flakey10,<OR>,20
"2022-10-01 12:48:27",5481936581E23D2D178105D44DB6915AB06BFB7F,"2022-09-30 09:37:20","2022-10-01 09:37:20",flakey10,snowflake,54284
"2022-10-01 13:28:56",91DA221A149007D0FD9E5515F5786C3DD07E4BB0,"2022-09-29 19:28:56","2022-09-30 19:28:56",crusty4,snowflake,4
"2022-10-01 17:28:56",91DA221A149007D0FD9E5515F5786C3DD07E4BB0,"2022-09-29 19:28:55","2022-09-30 19:28:55",crusty1,snowflake,4
"2022-10-02 05:24:57",5481936581E23D2D178105D44DB6915AB06BFB7F,"2022-09-30 09:36:47","2022-10-01 09:36:47",flakey11,<OR>,4

Merge request reports

Loading