arti-based obfs4 quick reachability monitor
In the "Better bridgestrap designs to quickly detect bridges going offline" section of tpo/anti-censorship/censorship-analysis#40035, I describe a tool we need that will notice when bridges go down / get blocked. One goal is to notice quickly enough that we can use the reachability result in the "bridge subscription" approach (tpo/anti-censorship/team#42) where clients auto-ask for a replacement when one of their bridges goes down.
-
Part one, a feature replacement for bridgestrap. That is, we need to take in a set of bridge lines to test (e.g. via an http post from rdsys, but we can do that interface however both sides like), launch connections to them, learn success or failure, and report it both in response to the request from rdsys and also write it out to a file like the ones here: https://collector.torproject.org/recent/bridgestrap/ -
Part two, the new feature: we want to learn, for each bridge that we think is currently up, the moment that it goes down. The goal is that clients will start coming to rdsys asking for a replacement bridge when they think one of their bridges goes down, and we want rdsys to already have our 'ground truth' answer by the time the client makes that request. The naive approach would be to connect to each bridge every 10 seconds or something using the 'part one' approach from above. But I think the better option is to hold open connections to each bridge, and then notice when a connection breaks. We will still be subject to TCP issues where sometimes it takes a while to notice a broken connection, but this approach is a low-cost way to get results within a 60-ish second response time range.
Some details that make 'part two' more complicated than we might first think:
-
We still need to launch new connections every so often too, to detect the case where the bridge stays up for existing connections but somehow gets firewalled for new ones. How often to launch those connections is a parameter we should explore, to find a good balance between being thorough vs limiting our overall connection volume.
-
Can the obfs4proxy tool handle hundreds or thousands of client connections, or does something fail at that scale? I don't know the answer, but if we discover scaling issues, we should either debug and resolve them, or maybe we work around them and use them as motivation to switch to the "arti runs a rust obfs4 thread" model that we already want to get to eventually.
-
If we simply make a connection to an obfs4 bridge and then try to hold it open, the bridge will expire and close that connection after a few minutes because it has no circuits on it. So we need to come up with a suitable trick for convincing the bridge to not expire the connection. Is making a one-hop circuit and leaving it open enough, or does it get expired after a while? How about if you put a dir stream on it? Worst case we can make a two-hop circuit, but it would be unfortunate if that's our best option: if we pick a new second hop each time, we undermine our bridge enumeration defenses (strategy 2 on https://blog.torproject.org/research-problems-ten-ways-discover-tor-bridges/), whereas if we pick one relay to use every time (e.g. Serge), everything fails if that relay goes away for a bit.
The eventual deployment vision is that (a) we run this tool in a safe uncensored location, where the goal is to get ground truth about upness; if it scales well we run one to test every bridge, or if needed we run a farm of them where each instance handles its fair share of the bridges, and then (b) we run one of these tools inside each censored area, where it does tests on demand and we only give it an address to test if our uncensored location says it's up but a threshold of clients have come to us saying it's down in their location.