Improve guard-spec behavior when our primary guards are down and we want to make many circuits

Right now in guard-spec, when you've decided all your primary guards are down, and you start trying the next guards in your list in order, you try one new next guard per circuit that you try to open.

The idea I think is to end up trying several new guards in parallel, in case the first one is slow to tell you it doesn't work -- but you still only actually use a guard for a real circuit once the earlier guards in the list have all definitely failed.

There are two suboptimal things about this approach:

We're potentially touching a whole lot more guards than we need to. For example, imagine we've gone offline and managed to mark our primary guards down, but then we come back online and we're running ricochet, and we have 100 contacts. We then launch 100 new circuits, which causes us to start connections to the next 100 guards in our list. That's a lot of surface area, impacting both security (many new guards that learn that I'm a Tor user) and network load.

(I spent a while thinking about: what do we do when we run out of guards in the filtered list? If we have n circuits where n is enough to exhaust the guards we are willing to try, what happens to the next circuit attempt? But I think it is ok, in the sense that we need to be able to retry circuits anyway to switch back to an earlier guard once it works, so we should be able to handle future circuits failing and needing to wait for an earlier guard to become a success. I don't have a good handle on whether the current code handles this case correctly, but I think there's nothing wrong with the design.)

Why should the number of new guards that we try in parallel be a function of the number of circuits we're hoping to build? If it's a good idea to try several in parallel in case the first one is slow to fail, then shouldn't we do that even if there's only one circuit waiting? And from the other side, if we have ten circuits waiting, why should that map to testing ten new guards, when it is super unlikely that we're going to end up using that tenth guard?

These two concerns are motivated by reading Nick's new proposal for a slight variation in guard-spec:
https://lists.torproject.org/pipermail/tor-dev/2021-October/014659.html
which I think inherits the same issues.

Here is a concrete alternative design: if our primary guards are down, and we don't yet have a guard that we know we want to use, and there is at least one circuit pending hoping to have a guard, then try to always have three new guard attempts in-flight. This way we are getting the parallel attempt feature, and we get it even if we don't have multiple circuits waiting; and also we are limiting our surface area, and focusing our guard attempts on the ones most likely to actually be used.

Edited Oct 22, 2021 by Roger Dingledine

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information