Say you have a relay that is up and listed as non-slow in the consensus. Due to the current overload (#24902 (moved)), this relay is getting many many circuit requests per second. Due to bug #24767 (moved), we will make a huge number of connection attempts to other relays that are down, because as soon as we get a "connection refused", we will get another circuit request that triggers another connection attempt.
So when your relay restarts, since it's still in the consensus and clients still think it's usable, it will immediately get flooded with circuit requests, causing these connection attempts to resume.
And Tor calls every one of those connection attempts a bootstrapping attempt, even if there are no origin circuits related to that connection attempt.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Jan 25 18:31:54.000 [warn] Problem bootstrapping. Stuck at 90%: Establishing a Tor circuit. (Connection timed out; TIMEOUT; count 1503; recommendation warn; host B7047FBDE9C53C39011CA84E5CB2A8E3543066D0 at 199.249.223.60:443)Jan 25 18:31:54.000 [warn] 1503 connections have failed:Jan 25 18:31:54.000 [warn] 1026 connections died in state connect()ing with SSL state (No SSL object)Jan 25 18:31:54.000 [warn] 392 connections died in state handshaking (TLS) with SSL state SSLv2/v3 read server hello A in HANDSHAKEJan 25 18:31:54.000 [warn] 84 connections died in state handshaking (Tor, v3 handshake) with SSL state SSL negotiation finished successfully in OPENJan 25 18:31:54.000 [warn] 1 connections died in state handshaking (TLS) with SSL state unknown state in HANDSHAKE
The hint is that this warn happens through control_event_bootstrap_prob_or() which is in four places in connection_or.c. That call happens when an OR connection fails, and none of those four callsites check whether it's an origin circuit, that is, whether it's "our" circuit or somebody else's.
You'll notice that somewhere along the line we wrapped those calls with
if (!authdir_mode_tests_reachability(options)
Commit d37fae2f is the commit from ancient history to skim, and commit 818332e7 is the more recent commit in this area that should give you some more context.
The fix might start with:
In connection_or.c, when we're considering whether to call control_event_bootstrap_prob_or, break that "if authdir_mode_tests_reachability" check out its into own function, called something like "could this connection be a bootstrap attempt", which checks if it's a reachability test, but also does more checks.
The trouble here is that in these "more checks", we want to check if there are any origin circuits pending on that connection attempt. And it looks like we don't actually know that here.
So the first option I thought of is that in that "could it be" function, we want to call circuit_get_all_pending_on_channel(), and then iterate over the resulting list to see if any of them are CIRCUIT_IS_ORIGIN(). That's certainly the most straightforward solution, in that it doesn't go mess with other code much. But it's kind of crummy, because we're walking another list (twice, in fact) every time there's a connection failure. It's not so bad though, because somewhere along the line somebody went to the trouble of making a separate list just for the circuits that are pending a channel attempt -- lucky us.
My second thought is: wait, we already do walk that list, later on, when we're trying to handle all of the circuits that were waiting for this channel to succeed or fail. That happens in circuit_n_chan_done(). So in the
if (!status) { /* chan failed; close circ */
clause in that loop, we could check if it's an origin circuit, and if so set a flag that makes us do a bootstrap complaint if the flag was ever set to 1.
But! circuit_n_chan_done() gets called from connection_or_about_to_close(), which is way after we know (and have then forgotten) why the conn closed. So that's still not best.
Could we add a field to or_connection_t remembering why the connection failed, and then use it during circuit_n_chan_done() to report bootstrap issues accurately? Maybe -- let's call that approach one.
A third option is: how about we set a flag in or_connection_t, which is "has an origin circuit ever been interested in my outcome", and that flag starts off as 0, and it gets set to 1 when we're about to add a circuit to circuits_pending_chans on a circuit where CIRCUIT_IS_ORIGIN() is true.
But! We only add to circuits_pending_chans inside circuit_set_state(), and that function doesn't know which chan or conn we're planning to attach this circ to.
The better callsite would be
circuit_handle_first_hop(), which is entirely for origin circuits. There are two cases we need to handle here. One is when there isn't a good conn available, and we decide to launch one for this circuit. That's the easy case: if channel_connect_for_circuit() returns NULL, and we launch a new one and it worked, we set the "we wanted this channel for an origin circuit" flag on it right then. The other is if channel_get_for_extend returned NULL but didn't set should_launch, meaning there is a conn somewhere out there, but it's not ready yet -- and we can't easily get ahold of it from this function.
For that second case, check out channel_get_for_extend(). There are two places that call it: circuit_handle_first_hop(), which is for origin circuits, and circuit_extend(), which is for handling EXTEND cells so it is explicitly not for origin circuits. If we added a flag to that function, to say whether it's an origin circuit we're asking on behalf of, then it can mark conns that had origin circuits asking about them. The only case I can see in that function where we would need to mark the chan is when we're incrementing n_inprogress_goodaddr.
So, let's call that "add a flag to the channel and turn it on when an origin circuit inquired" idea approach two.
I'm still looking into how best to do this. I think I'm starting to wrap my brain around it. What you call approach one and approach two differ in several ways, one of which is I suspect approach two would tend to minimize textual changes to existing code (which might be a reasonable goal for a 0.3.3 patch). A downside of approach two is pushing more higher-level state and logic into lower-level code, so I think it's not the best solution in the long term.
The status quo is that lower-level code has logic to use higher-level state to decide whether to call control_event_bootstrap_prob_or(). I think we should restructure things so that higher-level code has the necessary information from lower levels to make this decision, which weighs in favor of approach one. Maybe that's best done as a separate ticket, but I would like to try to make the existing abstractions cleaner instead of messier.
You'll notice that somewhere along the line we wrapped those calls with
{{{
if (!authdir_mode_tests_reachability(options)
}}}
Commit d37fae2f is the commit from ancient history to skim, and commit 818332e7 is the more recent commit in this area that should give you some more context.
It looks like commit c6a94718cd was the one that introduced the silencing of connection problems for dirauths.
Also some of the calls to authdir_mode_tests_reachability() are indirect through connection_or_connect_failed().
There are four places that call control_event_bootstrap_prob_or():
connection_or_about_to_close() in connection_or.c
This seems to have a fair amount of code in the TLS handshake
failure case, which could probably be factored out. It also
calls channel_closed() before any of the other stuff, which
maybe should be reordered?
It's only really called through connection_unlink() in main.c.
connection_or_connect_failed() in connection_or.c
This fairly simple function has multiple callers
connection_or_connect() in connection_or.c
This seems to only handle the case where a PT proxy is missing.
Notably, it does NOT call authdir_mode_tests_reachability().
connection_or_client_learned_peer_id() in connection_or.c
This seems to be when a router cert has an unexpected ID.
I'm thinking of ways to move the logic that currently calls control_event_bootstrap_prob_or() into either the channel layer or even higher. I think having the callers of control_event_bootstrap_prob_or() call authdir_mode_tests_reachability() themselves is unnecessary when control_event_bootstrap_prob_or() itself could probably safely do so. (The one case above where the caller doesn't call authdir_mode_tests_reachability() is a pluggable transport situation, and dirauths don't use pluggable transports that I know of.)
I believe it just produces scary-sounding warnings: the bootstrap complaint logs are entirely separate from Tor's decision to try to launch circuits and fetch things and etc.
I believe it just produces scary-sounding warnings: the bootstrap complaint logs are entirely separate from Tor's decision to try to launch circuits and fetch things and etc.
Thanks. I'll put it in sponsor8-can, then. I think it's also lower priority than bootstrap reporting issues that cause client-facing UX problems.
Fix title: this issue just produces spurious logs.
Trac: Summary: Relays consider it a bootstrapping failure if they can't extend for somebody else's circuit to Relays log a bootstrap warning if they can't extend for somebody else's circuit