Reachability Tests aren't conducted if there are no exit nodes

changed milestone to %Tor: 0.2.6.x-final

added 026-deferrable chutney component::core tor/tor lorax milestone::Tor: 0.2.6.x-final owner::teor parent::14034 priority::medium resolution::fixed status::closed test-network tor-relay type::defect version::tor 0.2.6.1-alpha labels

Marking as should-fix-eventually (0.2.???); I'd take a clean patch to fix this if somebody writes one.

Trac:
Milestone: N/A to Tor: 0.2.???
Keywords: N/A deleted, tor-relay test-network lorax added

Around 1-10% of chutney runs show an error like this on my OS X (multiprocessor) system, and chutney's transmission test fails. This condition can persist for well over 30 minutes, when a normal run of this chutney net succeeds within 18 seconds.

This appears to be exacerbated by:

the existing shorter intervals set for TestingTorNetwork
the custom shorter intervals set in the chutney templates
larger numbers of tor processes
in particular, larger numbers of authorities in the test network
background system load (e.g. compilation)

In this particular case, it looks like a race condition around chutney-launched tor processes causes this issue.

If it would help, I can provide logs, or try to produce a chutney net configuration with a larger failure rate.

#13787 (moved) may be a duplicate of this. nickm to confirm.

Trac:
Cc: N/A to teor, nickm

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

This short-circuits the chicken or egg issue by splitting the checks into internal and external. Internal can succeed, then activate conditions (exit ports), allowing external to succeed.

(Alternately, we can force exits using the new TestingDirAuthVoteExit in tor 0.2.6, which works sometimes, but not always.)

Trac:
Owner: N/A to teor
Summary: Reachability Tests aren't conducted if there are no exit nodes to `
Status: new to assigned

Oops, stray keypress in the title field.

Trac:
Summary: ` to Reachability Tests aren't conducted if there are no exit nodes

After attempting to test my proposed changes, I believe there are multiple race conditions in the network bootstrap that cause intermittent failures.

However, the chicken-and-egg exit issue covered by this bug produces reproducible failures (I believe it to be the cause of #13161 (moved) and one of the potential causes for #13787 (moved)).

In order to simplify testing, I have created a chutney config that (AFAIK) contains the smallest possible/reasonable Tor network: 3 authorities, 1 exit, 1 client.

Branch: basic-min Repository: https://github.com/teor2345/chutney.git

Nick, would you like to merge the chutney branch?

I will be testing my changes against this minimal config in order to eliminate intermittent failures from the more complex, rarer race conditions.

Success Criteria: The old (95%) and new code (99%) both succeed as long as TestingDirAuthVoteExit is turned on.

The old code fails (0%) when TestingDirAuthVoteExit is turned off. (See #13161 (moved).) The new code should reliably (95%) bootstrap with TestingDirAuthVoteExit turned off.

I'll get back to you after a few hundred test runs.....................

Trac:
Keywords: tor-relay test-network lorax deleted, tor-relay test-network lorax chutney added

I've posted the draft tor changes to:

Branch: bug13718-stop-req-exits-for-or-conns Repository: https://github.com/teor2345/tor.git

The branch contains two commits:

ignore exits when checking min dir info for internal connections (includes detailed log messages). This is the maximally compatible change that could be back-ported. Reported BOOTSTRAP_STATUS values try to look as much like the old version as possible. (Some duplicate events may be generated.)
split BOOTSTRAP_STATUS into INTERNAL and EXIT stages. This changes the values and number of events the controller will receive. This helps in determining whether we're hanging waiting for internal or exit paths. But it isn't necessary to back-port it.

I'll attach my continuous testing script, which could go in chutney or tor, if it would be useful. (Which one, Nick?)

I'm currently testing the failure rate of this code on OS X (i386 & x86_64), can others test on Linux & Windows?

This also probably needs some simple unit tests. Not quite sure how to write those.

Trac:
Version: N/A to Tor: 0.2.6.1-alpha

Trac:
continuous-test-network.sh

Continuously run chutney until data transmission fails. Good for intermittent errors.

Merged the chutney patch; will review the other one on the bus.

Thanks! here are some initial thoughts:

42e4c18236068984c027ec1d737b34595ada8ace:

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(), have_minimum_dir_info_for_internal_circ().
The documentation for status_out in compute_frac_paths_available needs to be explicit about allocating a new string (by convention).

440b10ec29d19459376d380bdd659fc8c9d5bb26

Need to tweak messages in bootstrap_status_to_string to make them a little more human-comprehensible, or users will wonder what they mean.
Do we needs corresponding control-spec changes to document these statuses?
This needs a changes/ file too.

Happy to make these changes, Nick.

I've now seen the statuses pop up when launching TorBrowser using this build, so I understand the need to comprehensibility.

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(), have_minimum_dir_info_for_internal_circ().

Is there the possibility of needing to calculate weights for guard, middle, and exit nodes in arbitrary combinations? (i.e. before choosing a guard node, ensure minimum guard bandwidth) If so, we could use a set of bit-shift flags.

If not, I'm happy to set up an enum with the two current values of Exit and Internal, and possibly an aliased value for those circumstances where we want a default option.

We may also need to update the status/enough-dir-info GETINFO control event - should we add status/enough-dir-info/exit and status/enough-dir-info/internal (we default status/enough-dir-info to exit for backwards compatibility).

I also wonder about the impact of changing the invocation of circuit_build_needed_circs() so that it runs when we know we have internal circuits, rather than waiting for exit circuits.

Should we split it into internal and exit versions? If so, which types of circuits go in each category?

Replying to teor:

Happy to make these changes, Nick.

I've now seen the statuses pop up when launching TorBrowser using this build, so I understand the need to comprehensibility.

I kinda want an enum for the argument to router_have_minimum_dir_info(), rather than a boolean. It seems like it would be clearer that way. Or possibly, there should be two wrappers around it: have_minimum_dir_info_for_exit_circ(), have_minimum_dir_info_for_internal_circ().

Is there the possibility of needing to calculate weights for guard, middle, and exit nodes in arbitrary combinations? (i.e. before choosing a guard node, ensure minimum guard bandwidth) If so, we could use a set of bit-shift flags.

I don't think so. It would be likelier to have to calculate weights for different kinds of circuits, I imagine.

If not, I'm happy to set up an enum with the two current values of Exit and Internal, and possibly an aliased value for those circumstances where we want a default option.

Sounds good.

We may also need to update the status/enough-dir-info GETINFO control event - should we add status/enough-dir-info/exit and status/enough-dir-info/internal (we default status/enough-dir-info to exit for backwards compatibility).

Sounds fine, though it could be a separate ticket.

I also wonder about the impact of changing the invocation of circuit_build_needed_circs() so that it runs when we know we have internal circuits, rather than waiting for exit circuits. Should we split it into internal and exit versions? If so, which types of circuits go in each category?

That's an interesting question, but it sounds like a separate ticket. Generally, anything that is a predicted circuit, or anything that might carry user traffic, is an exit circuit. Anything else is an internal circuit.

Split off #13813 (moved) for internal and exit sub-events to the status/enough-dir-info GETINFO control event.

Split off #13814 (moved) for building HS IP and other internal needed circuits earlier, once we can build internal paths.

Occasionally, the CPU load on my test machine will increase (or some other condition affecting the scheduler will occur), and a bootstrap race condition will cause the test to fail 50-100% of the time for a few hours. Then it will start working again.

The commands run are exactly the same each time. I'll be excluding these results from the tests, because they happen with or without the changes.

Perhaps lengthening some of the default intervals chutney uses would solve this?

Split this issue into #13823 (moved)

Replying to teor:

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

Another simpler hack might be to say that you don't have to think about whether you know about enough exits if there aren't any exits in the consensus you have.

Testing this patch appears to have revealed another bug where chutney-run tor authorities don't flag anything as an Exit (fixed in #13161 (moved) with TestingDirAuthVoteExit, perhaps related to #11264 (moved)).

Perhaps we should test with: TestingDirAuthVoteExit * AssumeReachable 0

This will avoid the issues in #13161 (moved) / #11264 (moved), while still testing the reachability bootstrapping concerns of the OP.

I have logged this as a separate issue #13839 (moved), where the policy is accept * but no Exit flag is assigned.

Replying to arma:

Replying to teor:

I believe that an appropriate fix for this issue is to extend router_have_minimum_dir_info to take a parameter dir_info_purpose indicating what the dir info would be used for. (Or, perhaps, a set of flags for guard/middle/exit. Have to look into this.)

Another simpler hack might be to say that you don't have to think about whether you know about enough exits if there aren't any exits in the consensus you have.

In particular, I worry about the case where the guard knows what info the client has gotten, and sees the client start building circuits so it knows more about the purpose of these circuits.

So there's still value imo in waiting for circuit-building until we have all the network info that we need for a variety of actions. The bug here is that we have the wrong definition of "all the network info that we need" when the network has no exits. So we should be fixing that definition.

OK, so from reading the code, if the network has no exits:

The following circuits should still be built, and are internal:

Hidden Service Circuits (Server, Client, Introduction Point)
Socks Proxy Circuits (when connected to HSs)

The following circuits should be built, but aren't currently configured as internal:

Circuit Build Timeout Circuits We could conditionally configure these as internal if there are no exits in the consensus.

The following circuits can never be built (and we shouldn't try, as it produces lots of errors):

Exit Circuits
Socks Proxy Circuits (when connected to Exits)

I'll change the current patch to:

if there are no exits in the consensus: build internal circuits as soon as we have enough info for internal circuits
if there are exits in the consensus: delay building internal circuits and build exit circuits when we have enough info for exit circuits

This involves a simple change in the modified router_have_minimum_dir_info to check for exits in the consensus.

This will, however, cause issues if we have a consensus that has exits, but we can't get (enough of) their descriptors. But this is no worse than the current behaviour.

To test this patch, I need to generate at least two consensuses:

one with no exits, which the exits can use to determine their own reachability based on internal paths
one with exits appropriately flagged after self-testing.

But MIN_VOTE_INTERVAL is 5 minutes, so I've defined MIN_VOTE_INTERVAL_TESTING 10. I've patched tor to use MIN_VOTE_INTERVAL_TESTING during testing. See #13823 (moved) for details.

Replying to teor:

This will, however, cause issues if we have a consensus that has exits, but we can't get (enough of) their descriptors. But this is no worse than the current behaviour.

I think I would call this part a feature: it reduces the impact from attacks by your guard to trickle information to you and then see when you start acting.

I think there's got to be some version of this we can do in 0.2.6

Trac:
Keywords: tor-relay test-network lorax chutney deleted, tor-relay test-network lorax chutney 026-deferrable added
Milestone: Tor: 0.2.??? to Tor: 0.2.6.x-final

I'm not able to do a code review right now, but I did run your branch on commit 440b10ec29d19459376d380bdd659fc8c9d5bb26 and stuck a relay into my no-exit network. The notices log is:

Dec 08 19:25:43.000 [notice] Bootstrapped 0%: Starting
Dec 08 19:25:43.000 [notice] This version of Tor (0.2.6.1-alpha-dev) is newer than any recommended version, according to the directory authorities. Recommended versions are: 0.2.4.23,0.2.4.24,0.2.5.6-alpha,0.2.5.7-rc,0.2.5.8-rc
Dec 08 19:25:43.000 [notice] Bootstrapped 35%: Asking for relay descriptors
Dec 08 19:25:43.000 [notice] We now have enough directory information to build internal circuits.
Dec 08 19:25:43.000 [notice] Bootstrapped 50%: Connecting to the Tor network (internal)
Dec 08 19:25:43.000 [notice] Bootstrapped 55%: Finishing handshake with first hop (internal)
Dec 08 19:25:43.000 [notice] We weren't able to find support for all of the TLS ciphersuites that we wanted to advertise. This won't hurt security, but it might make your Tor (if run as a client) more easy for censors to block.
Dec 08 19:25:43.000 [notice] To correct this, use a version of OpenSSL built with none of its ciphers disabled.
Dec 08 19:25:43.000 [notice] Bootstrapped 70%: Loading relay descriptors (internal)
Dec 08 19:29:48.000 [notice] I learned some more directory information, but not enough to build an exit circuit: We need more descriptors: we have 6/6, and can only build 0% of likely exit paths. (We have 100% of guards bw, 100% of midpoint bw, and 0% of exit bw = 0% of exit path bw.)

Previously I never got past the 50% mark, so that appears to be a good sign.

Thanks for these results, tom.

I am still working on an updated version of this branch, and can get it to 100% bootstrap.

But data transmission fails due to the following:

The authorities upload their descriptors, but appear to fail reachability self-testing (and therefore don't allow anything to exit? I need to confirm this).
The exit relay(s) appear to fail reachability self-testing (and therefore descriptor upload, so they don't appear in the consensus).
Even if reachability was working, I would also need the consensus to run again to pick up the changed descriptors - see #13823 (moved).
The client(s) complain they don't have any usable exits, so they fail to transmit data.

I will look into this within the next few days, and tweak a few things, then post an updated version of the code for comments (and, if I've made no progress, a plea for help).

I have made changes in #13823 (moved) that successfully has authorities create a consensus every 10 seconds in a chutney network. This should work as long as the clocks are strictly synchronised.

For comparison, the src/test/test-network.sh script allows 18 seconds for chutney to launch and do its tests, which is two of these consensuses.

The way tor determines reachability is broken for test, internal, and local networks.

When we're on a local address and DirAllowPrivateAddresses is 1, tor should check whether we're connecting to our own digest, or another router's.

Split off as #13924 (moved).

From #13823 (moved): A relay doesn't re-publish its descriptor until up to 60 seconds elapses. In a testing tor network, it should upload immediately when the ORPort or DirPort changes.

In a TestingTorNetwork, when TestingAuthDirTimeToLearnReachability is much lower than its normal value of 30 minutes, bootstrap will happen much more reliably if we test reachability at a proportionally faster rate.

Split off as #13929 (moved).

A relay with AssumeReachable 0 now makes it into the consensus after around 30-40 seconds. The minimum time should be (each consensus is 10 seconds apart with a 4s lead-up):

1 consensus to determine available relays, excluding this one (4s, initial consensus)
some time to download descriptors, and make an initial connection
some time to make a connection to ourselves for ORPort self-testing
some time to rebuild and upload descriptor
1 consensus to determine available relays, including this one (4s-14s) So if we can spend up to 18 seconds simply building consensuses, 30-40 seconds is a fair amount of elapsed time to complete this whole process.

This probably means we should increase the default src/test/test-network.sh bootstrap time to 40 seconds.

A relay with AssumeReachable 0 now makes it into the consensus after around 30-40 seconds, even without using TestingDirAuthVoteExit (from #13161 (moved)). This means that it correctly:

determines that no exits are available in the consensus
continues to bootstrap with internal paths only
successfully self-tests reachability with an internal path

I'll attach authority and relay torrcs that exhibit this behaviour with my new code.

I still haven't solved #13839 (moved) - the authorities don't flag anything as an exit (at least until around 30 minutes later?). But I think I'll leave that for another time.

Now I'll review my changes and group them into commits with appropriate changes/* files. This could take me a week or so.

Trac:
torrc

authority torrc that enables relays to bootstrap in a network without exits

Trac:
torrc.2

relay torrc that bootstraps in a network without exits

Trac:
Cc: teor, nickm to teor, nickm, dgoulet

#13839 (moved) can be resolved using TestingMinExitFlagThreshold 0, which causes authorities to ignore advertised / measured bandwidth when assigning the Exit flag.

A minor fix is required to router_is_active() to ignore measured bandwidth when assigning the Active flag (which is required for the Exit flag).

After this change:

An exit node with AssumeReachable 0 now makes it into the consensus after around 30-40 seconds. Clients then use this updated consensus from 40-60 seconds (the old consensus must expire first). The network uses TestingMinExitFlagThreshold 0 on the authorities, rather than TestingDirAuthVoteExit (from #13161 (moved)).

This means that the exit node correctly: (20s-30s)

determines that no Exits are available in the consensus
continues to bootstrap with internal paths only
successfully self-tests reachability with an internal path
posts its descriptor to the authorities

And the authorities correctly: (30s-40s)

assign the Exit flag
distribute an updated consensus containing the Exit node

And the client correctly: (40s-60s)

requests the newest consensus
- see #13963 (moved) for a fix that reduces this request time from 3 minutes to half the consensus interval, in networks with a low consensus interval.
determines that Exits are now available in the new consensus
starts building external paths

I'll attach authority, relay and client torrcs that exhibit this behaviour with the patches listed above.

Now I'll review my changes and group them into commits with appropriate changes/* files. This could take me a few days.

I now have the bootstrap process down to 30 seconds with changes to the TestingServerDownloadSchedule, TestingClientDownloadSchedule, and MIN_INITIAL_VOTE_INTERVAL_TESTING. I have defined MIN_INITIAL_VOTE_INTERVAL_TESTING as half of MIN_VOTE_INTERVAL_TESTING, as there is no previous consensus to clash with.

This should make dgoulet happy.

Updated attachments to follow.

Trac:
torrc.authority

authority torrc that bootstraps network with no exits to full functionality in 30 seconds

Trac:
torrc.exit

exit torrc that bootstraps network with no exits to full functionality in 30 seconds

Trac:
torrc.client

client torrc that bootstraps network with no exits to full functionality in 30 seconds

See also #13976 (moved), which would vastly simplify the configuration required to get rapid tor/chutney bootstraps to work.

Attaching example fast torrc files that will allow a chutney network to bootstrap in 8-10 seconds, as we can skip reachability testing, and give the Exit flag to everything. Of course, these will be available in the corresponding chutney patch. (Which I am working on right now.)

Trac:
torrc.authority.fast

Reachability Tests aren't conducted if there are no exit nodes

too long; didn't read

target function: consider_testing_reachability

call site #1: directory_info_has_arrived

call site #2 (closed): run_scheduled_events (and call site #3 (closed))

call site #4 (closed): circuit_testing_opened

Child items 0

Activity