entry guard retry schedule and bridge descriptor refetch schedule combine poorly
When we are configured to use bridges, there are two different subsystems that govern when to retry the bridges:
-
there's the entry guard system, which calls functions like entry_guard_consider_retry() that look at guard->last_tried_to_connect and decide to give your guard another shot after 10 minutes or so.
-
there's the bridge descriptor fetch system, with the core function fetch_bridge_descriptors() which looks at the download schedule in bridge->fetch_status.
The trouble here is that these two subsystems act independently, and have effects on each other.
When I'm running in the #40396 (closed) scenario with one good bridge and one bad bridge, it is imperative that when a guard (bridge) goes into state GUARD_REACHABLE_MAYBE, it resolves the MAYBE into a YES or NO quickly, because while it stays in MAYBE, if it is one of the primary guards, Tor will refuse to use the network:
Oct 23 06:58:35.115 [info] Our directory information is no longer up-to-date enough to build circuits: We're missing descriptors for 1/2 of our primary entry guards (total microdescriptors: 6379/6422). That's ok. We will try to fetch missing descriptors soon.
In this case, everything is going smoothly, because I have marked my non-working bridge as GUARD_REACHABLE_NO because I tried to fetch its descriptor and failed to connect to it, but then the guard subsystem kicks in:
Oct 23 06:55:34.555 [info] entry_guard_consider_retry(): Marked primary confirmed guard $9F090DE98CA6D67DEEB1F87EFE7C1BFD884E6E2E ($9F090DE98CA6D67DEEB1F87EFE7C1BFD884E6E2E) for possible retry, since we haven't tried to use it since 2021-10-23 06:40:50.
At that point I start to get bizarre failures for my directory fetches:
Oct 23 06:55:34.555 [notice] Ignoring directory request, since no bridge nodes are available yet.
and then I hit the dreaded "We're missing descriptors for 1/2 of our primary entry guards" line from #40396 (closed).
Now, in this particular case it worked out somewhat ok, because 5 minutes later fetch_bridge_descriptors() decided it was time to try getting a descriptor from my non-working bridge:
Oct 23 07:03:25.828 [debug] download_status_log_helper(): 69.163.35.159 attempted 27 time(s); I'll try again in 1361 seconds.
Oct 23 07:03:25.828 [debug] fetch_bridge_descriptors(): ask_bridge_directly=1 (0, 1, 0)
and then that bridge got marked back to GUARD_REACHABLE_NO and things recovered:
Oct 23 07:03:26.841 [info] We now have enough directory information to build circuits.
But my Tor spent those five minutes unwilling to do anything because it was missing a descriptor for one of its primary guards, and it wasn't even trying to fetch it.