Implement statistics gathering for number of Bridges-per-Transport in BridgeDB
As part of the SponsorS PT work, we promised a way to gather statistics on the number of bridges per transport.
The proposal states this is a task for Metrics. However, it's possible to do this on the BridgeDB side. In fact, it would help BridgeDB in the future to determine how to better allocate bridges to its Distributors (and help the Distributors hand them out to users in smarter ways).
Technically, BridgeDB already sort-of has data on the number of Bridges-per-Transport… or, rather, when a client requests a certain type of bridge from a certain Distributor (e.g. "give me an IPv4 obfs3 bridge from the HTTPS Distributor"), BridgeDB creates (or retrieves from a cache) a "filtered" subhashring containing only Bridges which fit the client's request. BridgeDB even logs the number of Bridges in these subhashrings in its DEBUG and INFO logs:
22:19:16 INFO L1361:Bridges.addRing() Bridges inserted into HTTPS-Transpo subring: 235 22:19:16 DEBUG L75:Dist.getNumBridgesPerA() Returning 3 bridges from ring of len: 235
The problem with using those numbers for statistics is that BridgeDB's Distributors may have multiple adjacent subhashrings, usually about 5. So, in the above case, there's roughly something like 1175=5*235 obfs3 bridges in the HTTPS Distributor. (These numbers aren't from the real deployed BridgeDB, by the way.)
A better way to do this would be to provide a database query (as part of #12031 (moved)) which counts the number of Bridges which claim to offer a PT. An example mechanism for doing this in Redis would be to keep a hash (i.e. using HSET or
HINCRBY) of Bridges which have any PTs, where the keys are the Bridge fingerprints, add a field for each type of PT, and then (if not using
IP:PORT[,IP:PORT[,IP:PORT[…]]], for example:
redis> HSET 26F6A7570E0F655DFDD054E79ACBB127112C2D7B obfs4 "184.108.40.206:4444,220.127.116.11:5555"
With that scheme, a new
HSET would be necessary each time the
@type bridge-extrainfo descriptors are parsed, but this only has time complexity O(1).
Some considerations / additional query parameters:
For these statistics, should we only count Bridges with the Running flag? Or only if the OONI machine says the PT is reachable?
What sanitisations should be done on these numbers? Should we round them? Or provide a scale, i.e. "between 1000-5000 obfs4 bridges"?
Do we want only the Bridges with a given PT? Or do we want the number of instances of a given PT (e.g. if a Bridge has multiple obfs3 instances)?