Keep irl's dynamic bridges around for a few days after rotation
Right now as I understand it, the moment we get a "blocked!" result out of bridgestatus in any country, we tell irl's dynamic bridge farm that it's time to rotate that bridge.
This is good from an overall reachability perspective, because any waiting, if we're right and the bridge got blocked, is bad for users.
But also, by spinning the old bridge down right then, we lose out on a lot of potential understanding:
-
How often are these 'blocked' events false positives? That is, how often do we rotate away from a bridge when actually it was just a brief connectivity issue or a slow bootstrap? These are cases that we are labeling "blocked" in our data yet we don't know how much we're overcounting the blocked cases.
-
By rotating the bridge immediately, we make it hard to figure out what happened in other countries. We have what look like a bunch of false positives in Turkey and Russia in the tpo/anti-censorship/team#92 (closed) analysis, where we get one connection failure from Turkey but it works from other countries, and then the next day the bridge is down, because we intentionally took it down after that first 'blocked' case.
This approach of taking the bridge down after the first failure prevents us from using a more nuanced definition of blocked such as "down three days in a row in this country while still reachable from some other countries during that time".
So my proposed fix is to change irl's dynamic bridge framework to schedule a spin-down of the old bridge for N (e.g. N=3) days in the future. We can still spin up the new one right then, pass the bridge lines back, etc just as we do now.
This way we get a view into the future -- of how that bridge works from each of our countries after the initial potential blocking event.
One downside is that irl might be running more total VM's than he originally expected, for the overlap period. This one seems solvable though.