Bad prop271 behavior when exhausting all guards
Here is a weird prop271 behavior. If you get Tor to exhaust all of its primary/confirmed/sampled guards, it will just get stuck until a guard gets marked for retry (which can take half an hour or more).
Specifically, if you disable the network until Tor starts hitting:
if (guard == NULL) {
log_warn(LD_GUARD, "Absolutely no sampled guards were available.");
return NULL;
}
it will just get stuck in a "Absolutely no sampled guards were available."
loop until a guard gets marked for retry.
In our pre-prop271 guard algorithm, we used to mark all guards as retriable if we exhaust all of them. I think this is a strictly better behavior than just waiting until a guard retry timeout triggers.
Furthermore in our pre-prop271 guard algorithm, when we exhaust all of our guards, we mark our network as likely-down. The idea is that if our network was marked as likely-down and then we managed to connect to a guard, we would treat that as a network-up event and then start trying guards from the top of our list.
This was a pretty effective heuristic that really saved lots of guard exposure to people with unstable internet. We have a similar one for prop271 but it's only based on time (get_internet_likely_down_interval() etc.) and not on behavior. I think doing this old heuristic in prop271 might be a great idea. Here is how it could work:
When we exhaust all our guards, we mark all guards as retriable. The next time we manage to connect to a guard, we stall the circuit and we call mark_primary_guards_maybe_reachable() so that we attempt to connect to our primaries again, before using that low-priority guard.