Make tor connection failures random-exponential-backoff

changed milestone to %Tor: 0.2.9.x-final in legacy/trac

added 028-triage in Legacy / Trac SponsorS in Legacy / Trac TorCoreTeam201606 in Legacy / Trac actualpoints::3 in Legacy / Trac component::core tor/tor in Legacy / Trac milestone::Tor: 0.2.9.x-final in Legacy / Trac owner::andrea in Legacy / Trac parent::17293 in Legacy / Trac points::3 in Legacy / Trac priority::high in Legacy / Trac resolution::implemented in Legacy / Trac review-group-3 in Legacy / Trac review-group-4 in Legacy / Trac reviewer::nickm in Legacy / Trac severity::normal in Legacy / Trac sponsor::U-can in Legacy / Trac status::closed in Legacy / Trac tor-dos in Legacy / Trac type::defect in Legacy / Trac labels

Included this in: https://pad.riseup.net/p/deprecating-old-tors2

We should also make protocol failures random-exponential-backoff.

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.8.x-final

Trac:
Keywords: N/A deleted, 028-triage added

Bulk-replace SponsorU keyword with SponsorU field.

Trac:
Sponsor: N/A to SponsorU
Keywords: SponsorU deleted, N/A added

Trac:
Points: N/A to medium

It is impossible that we will fix all 226 currently open 028 tickets before 028 releases. Time to move some out. This is my second pass through the "new" and tickets, looking for things to move to 0.2.9.

Trac:
Milestone: Tor: 0.2.8.x-final to Tor: 0.2.9.x-final

Trac:
Sponsor: SponsorU to SponsorU-can

Trac:
Keywords: N/A deleted, tor-dos added

Trac:
Points: medium to 3

Trac:
Parent: legacy/trac#15940 (moved) to legacy/trac#17293 (moved)

Taking ownership for 0.2.9 triage

Trac:
Reviewer: N/A to N/A
Severity: N/A to Normal
Status: new to assigned
Owner: N/A to andrea

Trac:
Keywords: N/A deleted, TorCoreTeam201606 added

Connection failures are detected in connection_handle_read_impl() / connection_handle_write_impl(), which call, generically, connection_close_immediate()/connection_mark_for_close_internal(), but also in the case of orconns, call connection_or_notify_error(), and call connection_edge_end_errno() for edge connections.

The connection_close_immediate()/connection_mark_for_close_internal() path flows to connection_about_to_close_connection(), which can call connection_dir_about_to_close(), connection_or_about_to_close(), connection_ap_about_to_close() or connection_exit_about_to_close(). In the case of orconns and edge connections everything interesting happens from connection_or_notify_error() and connection_edge_end_errno(), but connection_dir_about_to_close() is the trigger point for retrying downloads from the directory servers.

Edge connections are either outgoing from the exit, in which case we just send an END cell down the circuit on failure, from connection_edge_end_errno() -> connection_edge_end() -> connection_edge_send_command(), or incoming from the client, in which case we don't get any choices about retrying. There's no retry policy to change there.

Orconn failures cause circuits to die or fail to attach, and these flow through circuit_n_chan_done() and circuit_unlink_all_from_channel() from channel_closed(). Ultimately, connection failures end up in circuit_about_to_free(), and then for origin circuits in circuit_build_failed() when handling a circuit closed for error.

Since all of these possible failure cases are ultimately driven from somewhere else (e.g., exit connection fails) and trigger reporting back to the cause of that connection (e.g. send END cell) rather than retrying, or are on the client side and become a matter of general circuit-building policy, for this ticket I'll be focusing attention on retries of failed downloads from the directory servers. We should think about backoffs for circuit building at some point perhaps, but it seems to be largely separable from the question of directories, less critical for DoS-resistance since there aren't analogous heavily loaded elements like the authorities, and more security-sensitive because of potential implications for behavior when we fail to connect to our preferred entry guard.

Failed directory connections are handled in connection_dir_request_failed(), which calls:

networkstatus_consensus_download_failed() for a consensus
- calls download_status_failed() / update_consensus_networkstatus_downloads()
  - download_status_failed() is macro for download_status_increment_failure()
connection_dir_download_cert_failed() for a certificate
- calls authority_cert_dl_failed() / update_certificate_downloads()
- this ultimately uses download_status_t too just like the consensus download; see download_status_is_ready_by_sk_in_cl() and friends in routerlist.c
connection_dir_download_routerdesc_failed()

890 /* No need to relaunch descriptor downloads here: we already do it 891 * every 10 or 60 seconds (FOO_DESCRIPTOR_RETRY_INTERVAL) in main.c. */

- The mechanism here is in launch_descriptor_fetches_callback()/reset_descriptor_failures_callback();
  we can realize exponential backoff by suitable adjustments

connection_dir_bridge_routerdesc_failed()
- calls connection_dir_retry_bridges()
  - calls retry_bridge_descriptor_fetch_directly()
    - calls launch_direct_bridge_descriptor_fetch()

At minimum, it should be easy to implement exponential backoffs for consensus and certificate downloads through the download_status_t mechanism, since they already notify it of their successes/failures and ask it whether we're ready to attempt a new download yet. Further ivestigation of the right approach for the bridge descriptor and router descriptor download cases pending.

Seems we're using the download_status_t mechanism for all four types of dirserver downloads:

download_status_t initializers in networkstatus.c control consensus downloads
initialized in download_status_cert_init() of routerlist.c for cert downloads
fetch_status in bridge_info_t of entrynodes.c for bridge descriptors, set up in bridge_add_from_config()
for routerdescs, there's a download_status_t in routerstatus_t, and these are ultimately created by routerstatus_parse_entry_from_string()

Please review implementation in my bug15942 branch; this has been tested by unit tests for the random exponential backoff download schedule in src/test/test_dir.c, and by using iptables to block outgoing TCP connections while bootstrapping a client to observe backoffs in progress.

Trac:
Status: assigned to needs_review

Move some tickets into review-group-3: they are in 0.2.9, and they are needs_review.

Trac:
Keywords: N/A deleted, review-group-3 added

Trac:
Reviewer: N/A to nickm

I tried using gitlab to review your legacy/trac#15942 (moved) branch, at https://gitlab.com/nickm_tor/tor/merge_requests/1 . What do you think?

(Except as noted, the branch lgtm.)

Trac:
Status: needs_review to needs_revision

(oh and also: most of my questions on that branch are actual questions, and not sneaky suggestions. "No, that would be a bad idea" is an okay answer in most cases.)

See rebased and updated bug15942_v2 branch.

Trac:
Status: needs_revision to needs_review

okay, now I get it.

But now that I look at it, I wonder whether the shift-trick isn't a little too clever. What if we just did it like this bug15942_v2_alternative ?

Trac:
Keywords: N/A deleted, review-group-4 added

Seems reasonable as long as entropy is sufficiently cheap; I'm fine with that alternative I think.

Okay. Merged it. Let's see how it works out! (Please set the "actual points" field to the approx number of coding days you spent on this.)

Trac:
Status: needs_review to closed
Resolution: N/A to implemented

Trac:
Actualpoints: N/A to 3

closed

changed time estimate to 24h

added 24h of time spent

mentioned in issue legacy/trac#19377 (moved)

mentioned in issue legacy/trac#20534 (moved)

mentioned in issue legacy/trac#22355 (moved)

moved from legacy/trac#15942 (moved)

added Bug label and removed 1 deleted label

added 1 deleted label and removed 1 deleted label

added DoS label and removed 1 deleted label

Make tor connection failures random-exponential-backoff

Child items 0

Activity