We lowered the number of directory guards to 2 in part because I suspected it was causing extra guard connections to get made and kept open, leading to fingerprinting: tpo/network-health/team#325 (closed)
Some tinkering with this seems to show that it can definitely happen at startup if a guard connection is delayed due to networking issues. When this happened, I ended up with a guard with confirmed_idx=20 (very low priority) as a third guard, once connectivity resumed.
This third guard connection then stayed around, because Tor would still happily use it for more circuits, instead of avoiding it for all future use, once we had enough connected guards with higher priority.
Worse, even when I killed the connection, Tor immediately respawned two more connection attempts to this guard, even though it already had 2 other guards open at that point. The most bizarre thing to me is that it kept trying for a while.
I am wondering if we can add also more checks not to launch more guards if we already have enough connected. I am guessing that such checks exist, but some other part of the maze is overriding them, or they are checking in the wrong place in terms of circuit construction/retry.
It's not that simple, but good guess. It appears to be conflux related.
Conflux applies Guard restrictions so that the same guard cannot end up in both legs of the conflux set.
However, it applies these restrictions to exclude guards before the number of primary guards is considered (in select_primary_guard_for_circuit()). Guards that are not excluded are added to the list of usable guards, and this loop stops once the primary guard limit is reached.
So for each conflux set with 1 leg, it will build a list of two more "primary guards" to choose from, for the second leg. Thus, a third guard can be used for the second leg of some conflux sets.
Ugh this one is going to be an annoying amount of refactoring to fix. We were using the restrictions so that we did not have to redo all the guard filtering logic, but it appears we can't use them for this reason...
Unless we set some kind of flag on guard restrictions to make them "temporary" or something, and set a bool in-param if a temporary restriction is hit, so that the list counter does not count temporarily-excluded primary guards.
This has fragility if the list can somehow become empty from temporary restrictions, so we'll have to make sure we always end up with at least one guard before exiting the loop.
I realized a third guard can also be used if the user's Guard is also an Exit, and it is chosen as an Exit. Then, the list of primary guards can also have a third node.
There are also similar issues with the confirmed list, secondary list, and filtered list.
I think the startup problem I noticed in #40876 (comment 2957143) is also separate. But I believe with this fix, it might stop using those extra startup guards now? I will have to dig to make sure, though.
Ok I am just going to focus on the immediate issue with this bug, rather than dig into the infinite rabbithole of ways Tor can end up using different guards sometimes. I added extra logs to try to diagnose the other cases later.