Reachability Tests aren't conducted if there are no exit nodes
Context: * https://lists.torproject.org/pipermail/tor-dev/2014-October/007613.html * https://lists.torproject.org/pipermail/tor-dev/2014-October/007654.html On 22 October 2014 05:48, Roger Dingledine <arma@mit.edu> wrote: >> What I had to do was make one of my Directory Authorities an exit - >> this let the other nodes start building circuits through the >> authorities and upload descriptors. > > This part seems surprising to me -- directory authorities always publish > their dirport whether they've found it reachable or not, and relays > publish their descriptors directly to the dirport of each directory > authority (not through the Tor network). > > So maybe there's a bug that you aren't describing, or maybe you are > misunderstanding what you saw? > > See also https://trac.torproject.org/projects/tor/ticket/11973 > >> Another problem I ran into was that nodes couldn't conduct >> reachability tests when I had exits that were only using the Reduced >> Exit Policy - because it doesn't list the ORPort/DirPort! (I was >> using nonstandard ports actually, but indeed the reduced exit policy >> does not include 9001 or 9030.) Looking at the current consensus, >> there are 40 exits that exit to all ports, and 400-something exits >> that use the ReducedExitPolicy. It seems like 9001 and 9030 should >> probably be added to that for reachability tests? > > The reachability tests for the ORPort involve extending the circuit to > the ORPort -- which doesn't use an exit stream. So your relays should > have been able to find themselves reachable, and published a descriptor, > even with no exit relays in the network. Okay, so the behavior I saw, and reproduced, is that reachability tests didn't succeed (and therefore descriptors weren't uploaded) when there were no exits. I think I may have figured out why, but there are some internals I haven't completely figured out. I'm going to lay out what I think and then the parts I'm not completely sure about. First off, you're (obviously) correct about me misunderstanding extending the circuit via an Exit stream, that's not necessary. But still, I think the lack of Exits stopped the reachability tests from succeeding. ## too long; didn't read I don't think reachability tests happen when there are no Exit nodes because of a quirk in the bootstrapping process, where we never think we have a minimum of directory information. ## target function: consider_testing_reachability A reachability test is conducted from `consider_testing_reachability` (I think it's only conducted from here? Although maybe there's other situations it could happen..?) `consider_testing_reachability` is called from `circuit_send_next_onion_skin`, `circuit_testing_opened`, `run_scheduled_events`, and `directory_info_has_arrived`. ## call site #1: directory_info_has_arrived This is called very frequently on router startup. But `consider_testing_reachability` will not be called if `router_have_minimum_dir_info` returns false: ``` void directory_info_has_arrived(time_t now, int from_cache) { //... if (!router_have_minimum_dir_info()) { //... return; } else { /* ... */ } if (server_mode(options) && !net_is_disabled() && !from_cache && (can_complete_circuit || !any_predicted_circuits(now))) consider_testing_reachability(1, 1); } ``` `router_have_minimum_dir_info` returns the static variable `have_min_dir_info`. This variable is only set to 1 in `update_router_have_minimum_dir_info` and then only if there are Exits! Specifically, we will trigger `paths < get_frac_paths_needed_for_circs(options,consensus)` because we have 0% of the Exit Bandwidth, as shown by this error message: ``` Nov 09 22:10:26.000 [notice] I learned some more directory information, but not enough to build a circuit: We need more descriptors: we have 5/5, and can only build 0% of likely paths. (We have 100% of guards bw, 100% of midpoint bw, and 0% of exit bw.) ``` ``` update_router_have_minimum_dir_info(void) { //... char *status = NULL; int num_present=0, num_usable=0; double paths = compute_frac_paths_available(consensus, options, now, &num_present, &num_usable, &status); if (paths < get_frac_paths_needed_for_circs(options,consensus)) { tor_snprintf(dir_info_status, sizeof(dir_info_status), "We need more %sdescriptors: we have %d/%d, and " "can only build %d%% of likely paths. (We have %s.)", using_md?"micro":"", num_present, num_usable, (int)(paths*100), status); //... res = 0; goto done; } res = 1; } done: if (res && !have_min_dir_info) { /* ... */ } if (!res && have_min_dir_info) { int quiet = directory_too_idle_to_fetch_descriptors(options, now); tor_log(quiet ? LOG_INFO : LOG_NOTICE, LD_DIR, "Our directory information is no longer up-to-date " "enough to build circuits: %s", dir_info_status); /* a) make us log when we next complete a circuit, so we know when Tor * is back up and usable, and b) disable some activities that Tor * should only do while circuits are working, like reachability tests * and fetching bridge descriptors only over circuits. */ can_complete_circuit = 0; control_event_client_status(LOG_NOTICE, "NOT_ENOUGH_DIR_INFO"); } have_min_dir_info = res; } ``` (The exact source line is in `frac_nodes_with_descriptors`, called by `compute_frac_paths_available`:) ``` /** For all nodes in <b>sl</b>, return the fraction of those nodes, weighted * by their weighted bandwidths with rule <b>rule</b>, for which we have * descriptors. */ double frac_nodes_with_descriptors(const smartlist_t *sl, bandwidth_weight_rule_t rule) { //... if (smartlist_len(sl) == 0) return 0.0; ``` This prevents reachability from occurring from `directory_info_has_arrived`. ## call site #2: run_scheduled_events (and call site #3) There's a litany of conditions to call `consider_testing_reachability` from `run_scheduled_events`. In particular, there's `can_complete_circuit` ``` if (time_to_check_descriptor < now && !options->DisableNetwork) { //... /* also, check religiously for reachability, if it's within the first * 20 minutes of our uptime. */ if (is_server && (can_complete_circuit || !any_predicted_circuits(now)) && !we_are_hibernating()) { if (stats_n_seconds_working < TIMEOUT_UNTIL_UNREACHABILITY_COMPLAINT) { consider_testing_reachability(1, dirport_reachability_count==0); ``` `can_complete_circuit` is only set in `circuit_send_next_onion_skin`, but then only if a circuit is built and it is not `circ->build_state->onehop_tunnel`. I _think_ this means the circuit is a full circuit, complete with Exit. Right? ``` int circuit_send_next_onion_skin(origin_circuit_t *circ) { //... if (circ->cpath->state == CPATH_STATE_CLOSED) { // ... } else { //... hop = onion_next_hop_in_cpath(circ->cpath); if (!hop) { //... if (!can_complete_circuit && !circ->build_state->onehop_tunnel) { can_complete_circuit=1; /* FFFF Log a count of known routers here */ log_notice(LD_GENERAL, "Tor has successfully opened a circuit. " "Looks like client functionality is working."); //... if (server_mode(options) && !check_whether_orport_reachable()) { inform_testing_reachability(); consider_testing_reachability(1, 1); ``` This is also the third place `consider_testing_reachability` is called - there is only one left: ## call site #4: circuit_testing_opened ``` /** A testing circuit has completed. Take whatever stats we want. * Noticing reachability is taken care of in onionskin_answer(), * so there's no need to record anything here. But if we still want * to do the bandwidth test, and we now have enough testing circuits * open, do it. */ static void circuit_testing_opened(origin_circuit_t *circ) { if (have_performed_bandwidth_test || !check_whether_orport_reachable()) { /* either we've already done everything we want with testing circuits, * or this testing circuit became open due to a fluke, e.g. we picked * a last hop where we already had the connection open due to an * outgoing local circuit. */ circuit_mark_for_close(TO_CIRCUIT(circ), END_CIRC_AT_ORIGIN); } else if (circuit_enough_testing_circs()) { router_perform_bandwidth_test(NUM_PARALLEL_TESTING_CIRCS, time(NULL)); have_performed_bandwidth_test = 1; } else consider_testing_reachability(1, 0); } ``` But... as far as I can tell - a testing circuit is only used for two things: conducting a reachability test and conducting a bandwidth self-test. The only place a bandwidth self-test is called is inside `circuit_testing_opened`. So this call of `consider_testing_reachability` is a chicken or the egg problem.
issue