Relays log a bootstrap warning if they can't extend for somebody else's circuit

changed milestone to %Tor: unspecified

added 033-included-20180320 033-triage-20180320 033-unreached-backport 034-deferred-20180602 035-removed-20180711 backport-032 bootstrap component::core tor/tor ex-catalyst-20190619 ex-sponsor-19 ex-sponsor19 milestone::Tor: unspecified priority::medium s8-errors severity::normal stability status::new type::defect labels

cc'ing catalyst since they've been working on bootstrapping stuff and might be the perfect person for doing up (and/or reviewing) a patch

Trac:
Cc: N/A to catalyst

I noticed this bug on my freeBogatov relay, and then a relay operator reported it on tor-relays too: https://lists.torproject.org/pipermail/tor-relays/2018-January/014316.html so I figured that was the right time to file a ticket.

Jan 25 18:31:54.000 [warn] Problem bootstrapping. Stuck at 90%: Establishing a Tor circuit. (Connection timed out; TIMEOUT; count 1503; recommendation warn; host B7047FBDE9C53C39011CA84E5CB2A8E3543066D0 at 199.249.223.60:443)
Jan 25 18:31:54.000 [warn] 1503 connections have failed:
Jan 25 18:31:54.000 [warn]  1026 connections died in state connect()ing with SSL state (No SSL object)
Jan 25 18:31:54.000 [warn]  392 connections died in state handshaking (TLS) with SSL state SSLv2/v3 read server hello A in HANDSHAKE
Jan 25 18:31:54.000 [warn]  84 connections died in state handshaking (Tor, v3 handshake) with SSL state SSL negotiation finished successfully in OPEN
Jan 25 18:31:54.000 [warn]  1 connections died in state handshaking (TLS) with SSL state unknown state in HANDSHAKE

I'm "must-ing" this ticket because I think we should avoid printing those and should properly report the bootstrap failure.

Trac:
Keywords: N/A deleted, 033-must added

Trac:
Keywords: N/A deleted, stability added

Marking all tickets reached by current round of 033 triage.

Trac:
Keywords: N/A deleted, 033-triage-20180320 added

Mark 033-must tickets as triaged-in for 0.3.3

Trac:
Keywords: N/A deleted, 033-included-20180320 added

Trac:
Status: new to assigned
Owner: N/A to catalyst

The hint is that this warn happens through control_event_bootstrap_prob_or() which is in four places in connection_or.c. That call happens when an OR connection fails, and none of those four callsites check whether it's an origin circuit, that is, whether it's "our" circuit or somebody else's.

You'll notice that somewhere along the line we wrapped those calls with

   if (!authdir_mode_tests_reachability(options)

Commit d37fae2f is the commit from ancient history to skim, and commit 818332e7 is the more recent commit in this area that should give you some more context.

The fix might start with:

In connection_or.c, when we're considering whether to call control_event_bootstrap_prob_or, break that "if authdir_mode_tests_reachability" check out its into own function, called something like "could this connection be a bootstrap attempt", which checks if it's a reachability test, but also does more checks.

The trouble here is that in these "more checks", we want to check if there are any origin circuits pending on that connection attempt. And it looks like we don't actually know that here.

So the first option I thought of is that in that "could it be" function, we want to call circuit_get_all_pending_on_channel(), and then iterate over the resulting list to see if any of them are CIRCUIT_IS_ORIGIN(). That's certainly the most straightforward solution, in that it doesn't go mess with other code much. But it's kind of crummy, because we're walking another list (twice, in fact) every time there's a connection failure. It's not so bad though, because somewhere along the line somebody went to the trouble of making a separate list just for the circuits that are pending a channel attempt -- lucky us.

My second thought is: wait, we already do walk that list, later on, when we're trying to handle all of the circuits that were waiting for this channel to succeed or fail. That happens in circuit_n_chan_done(). So in the

      if (!status) { /* chan failed; close circ */

clause in that loop, we could check if it's an origin circuit, and if so set a flag that makes us do a bootstrap complaint if the flag was ever set to 1.

But! circuit_n_chan_done() gets called from connection_or_about_to_close(), which is way after we know (and have then forgotten) why the conn closed. So that's still not best.

Could we add a field to or_connection_t remembering why the connection failed, and then use it during circuit_n_chan_done() to report bootstrap issues accurately? Maybe -- let's call that approach one.

A third option is: how about we set a flag in or_connection_t, which is "has an origin circuit ever been interested in my outcome", and that flag starts off as 0, and it gets set to 1 when we're about to add a circuit to circuits_pending_chans on a circuit where CIRCUIT_IS_ORIGIN() is true.

But! We only add to circuits_pending_chans inside circuit_set_state(), and that function doesn't know which chan or conn we're planning to attach this circ to.

The better callsite would be circuit_handle_first_hop(), which is entirely for origin circuits. There are two cases we need to handle here. One is when there isn't a good conn available, and we decide to launch one for this circuit. That's the easy case: if channel_connect_for_circuit() returns NULL, and we launch a new one and it worked, we set the "we wanted this channel for an origin circuit" flag on it right then. The other is if channel_get_for_extend returned NULL but didn't set should_launch, meaning there is a conn somewhere out there, but it's not ready yet -- and we can't easily get ahold of it from this function.

For that second case, check out channel_get_for_extend(). There are two places that call it: circuit_handle_first_hop(), which is for origin circuits, and circuit_extend(), which is for handling EXTEND cells so it is explicitly not for origin circuits. If we added a flag to that function, to say whether it's an origin circuit we're asking on behalf of, then it can mark conns that had origin circuits asking about them. The only case I can see in that function where we would need to mark the chan is when we're incrementing n_inprogress_goodaddr.

So, let's call that "add a flag to the channel and turn it on when an origin circuit inquired" idea approach two.

I pushed a bug25061-demo commit to my arma repo, which sketches out how approach two would look.

I'm still looking into how best to do this. I think I'm starting to wrap my brain around it. What you call approach one and approach two differ in several ways, one of which is I suspect approach two would tend to minimize textual changes to existing code (which might be a reasonable goal for a 0.3.3 patch). A downside of approach two is pushing more higher-level state and logic into lower-level code, so I think it's not the best solution in the long term.

The status quo is that lower-level code has logic to use higher-level state to decide whether to call control_event_bootstrap_prob_or(). I think we should restructure things so that higher-level code has the necessary information from lower levels to make this decision, which weighs in favor of approach one. Maybe that's best done as a separate ticket, but I would like to try to make the existing abstractions cleaner instead of messier.

Catalyst: given your summary above, I agree that approach two is better for an 0.3.3 patch, especially if we're thinking of backporting it.

Replying to arma:

You'll notice that somewhere along the line we wrapped those calls with {{{ if (!authdir_mode_tests_reachability(options) }}}

Commit d37fae2f is the commit from ancient history to skim, and commit 818332e7 is the more recent commit in this area that should give you some more context. It looks like commit c6a94718cd was the one that introduced the silencing of connection problems for dirauths.

Also some of the calls to authdir_mode_tests_reachability() are indirect through connection_or_connect_failed().

Trac:
Sponsor: N/A to Sponsor8
Keywords: N/A deleted, bootstrap, s8-errors added

Bump to 0.3.4 per nickm.

Trac:
Keywords: 033-must deleted, 033-backport added
Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final

There are four places that call control_event_bootstrap_prob_or():

connection_or_about_to_close() in connection_or.c
- This seems to have a fair amount of code in the TLS handshake failure case, which could probably be factored out. It also calls channel_closed() before any of the other stuff, which maybe should be reordered?
- It's only really called through connection_unlink() in main.c.
connection_or_connect_failed() in connection_or.c
- This fairly simple function has multiple callers
connection_or_connect() in connection_or.c
- This seems to only handle the case where a PT proxy is missing. Notably, it does NOT call authdir_mode_tests_reachability().
connection_or_client_learned_peer_id() in connection_or.c
- This seems to be when a router cert has an unexpected ID.

I'm thinking of ways to move the logic that currently calls control_event_bootstrap_prob_or() into either the channel layer or even higher. I think having the callers of control_event_bootstrap_prob_or() call authdir_mode_tests_reachability() themselves is unnecessary when control_event_bootstrap_prob_or() itself could probably safely do so. (The one case above where the caller doesn't call authdir_mode_tests_reachability() is a pluggable transport situation, and dirauths don't use pluggable transports that I know of.)

Deferring non-must tickets to 0.3.5

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: 0.3.5.x-final
Keywords: N/A deleted, 034-deferred-20180602 added

These tickets are being triaged out of 0.3.5. The ones marked "035-roadmap-proposed" may return.

Trac:
Keywords: N/A deleted, 035-removed-20180711 added
Milestone: Tor: 0.3.5.x-final to Tor: unspecified

We should double check whether this merely makes scary-sounding warnings, or if it actually prevents a relay from completing bootstrapping.

I believe it just produces scary-sounding warnings: the bootstrap complaint logs are entirely separate from Tor's decision to try to launch circuits and fetch things and etc.

Replying to arma:

I believe it just produces scary-sounding warnings: the bootstrap complaint logs are entirely separate from Tor's decision to try to launch circuits and fetch things and etc. Thanks. I'll put it in sponsor8-can, then. I think it's also lower priority than bootstrap reporting issues that cause client-facing UX problems.

Trac:
Sponsor: Sponsor8 to Sponsor8-can

Trac:
Sponsor: Sponsor8-can to Sponsor19-can

These open, non-merge_ready tickets can not get backported to 0.3.3, because 0.3.3 is now unsupported.

Trac:
Keywords: 033-backport deleted, N/A added

Hmm, I guess they should still get 033-backport-unreached

Trac:
Keywords: N/A deleted, 033-backport-unreached added

Adding the keyword to mark everything that didn't fit into the time for sponsor 19.

Trac:
Keywords: N/A deleted, ex-sponsor-19 added

Remove sponsor 19 and add a keyword ex-sponsor19 to mark all the tickets that could have been in the scope of the sponsor.

Trac:
Keywords: N/A deleted, ex-sponsor19 added
Sponsor: Sponsor19-can to N/A

Remove myself from tickets I'm not actively working on.

Trac:
Keywords: N/A deleted, ex-catalyst-20190619 added
Owner: catalyst to N/A

Trac:
Status: assigned to new

Fix 033-unreached-backport spelling.

Trac:
Keywords: 033-backport-unreached deleted, 033-unreached-backport added

Fix title: this issue just produces spurious logs.

Trac:
Summary: Relays consider it a bootstrapping failure if they can't extend for somebody else's circuit to Relays log a bootstrap warning if they can't extend for somebody else's circuit

mentioned in issue #28281 (moved)

moved to tpo/core/tor#25061 (closed)

Relays log a bootstrap warning if they can't extend for somebody else's circuit

Child items 0

Activity