clients not detecting stale onion service introduction points
Tor version 0.3.2.10 running on Debian stretch;
The Tor client runs continuously over many days. Periodically (daily in this case), an HTTP client reaches out to the same v2 onion service via Tor. Normally, this works correctly.
However, sometimes (no more than about 2/5 of the days), the HTTP client will fail to connect properly to the onion service, and its TCP connection will simply time out. Restarting Tor almost always solves this problem, although automating a reactive Tor restart is undesirable, particularly when doing so closes all of the circuits of all other client applications using the same Tor.
Looking closely, we observe that in such cases the Tor client is not able to complete a rendezvous with the onion service. Specifically, the Tor client reaches out to the same intro point many times in rapid succession, without ever managing to establish a connection. I surmise that a workaround might entail asking Tor to clear its cached onion service descriptor for this particular v2 service. However, the code to handle client connections to onion services in hs_client.c
intends to handle introduction point failures, so we should really determine why the many attempts to connect to the same introduction point are not logged as a failure that would ultimately lead to fetching the server descriptor again.
My logs show that during a typical onion service request, my Tor attempted to connect to the same introduction point, 176.36.20.10, a total of 47 times in the 112-second interval between 06:52:06 and 06:53:58 UTC.
Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): delivering 33 cell forward.
Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): Sending a RELAY_EARLY cell; 4 remaining.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] append_cell_to_circuit_queue(): Made a circuit active.
Apr 21 06:53:55.000 [debug] scheduler_channel_has_waiting_cells(): Channel 64 at 0x1c12450 went from waiting_for_cells to pending
Apr 21 06:53:55.000 [info] circuit_get_best(): There is an intro circuit being created right now, but it has already taken quite a while. Starting one in parallel.
Apr 21 06:53:55.000 [info] circuit_get_open_circ_or_launch(): Chose $04D962C6155BFC3705DC3 8699D4E6B3CE1524AE7~$04D962C6155BFC3705 at 176.36.20.10 as intro point for '[scrubbed]'.
Apr 21 06:53:55.000 [debug] circuit_find_to_cannibalize(): Hunting for a circ to cannibalize: purpose 6, uptime 0, capacity 1, internal 1
Apr 21 06:53:55.000 [info] circuit_launch_by_extend_info(): Cannibalizing circ 3202873042 (id: 457) for purpose 6 (Hidden service client: Connecting to intro point)
Apr 21 06:53:55.000 [debug] circuit_change_purpose(): changing purpose of origin circ 457 from "General-purpose client" (5) to "Hidden service client: Connecting to intro point" (6)
Apr 21 06:53:55.000 [debug] circuit_send_intermediate_onion_skin(): starting to send subsequent skin.
Apr 21 06:53:55.000 [info] circuit_send_intermediate_onion_skin(): Sending extend relay cell.
Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): delivering 6 cell forward.
Apr 21 06:53:55.000 [debug] relay_send_command_from_edge_(): Sending a RELAY_EARLY cell; 4 remaining.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] circuit_package_relay_cell(): encrypting a layer of the relay cell.
Apr 21 06:53:55.000 [debug] append_cell_to_circuit_queue(): Made a circuit active.
Apr 21 06:53:55.000 [info] rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2522 more seconds.
Apr 21 06:53:55.000 [info] connection_ap_handshake_attach_circuit(): Intro 3202873042 (id: 457) and rend circuit 2500882674 (id: 483) circuits are not both ready. Stalling conn. (109 sec old)
Apr 21 06:53:55.000 [info] circuit_predict_and_launch_new(): Have 12 clean circs (0 uptime-internal, 12 internal), need another hidden service circ.
Note that the same introduction point is chosen each time, and Tor never decides that the introduction point is invalid. Suggest that this may lead Tor to not request a new onion service descriptor. (It is not clear to me that the "interesting" AS in which the introduction point resides (39608, in the United Arab Emirates) is important in this case.) Suggest that the problem may be a matter of abstraction layers, e.g. Tor invalidates introduction points at a higher layer but retries at a lower layer such that the retries are opaque to the invalidation process.
This problem should be considered important, as it may lead to long-lived Tor clients becoming unable to connect to long-lived onion services, not to mention Type I errors in the systematic assessment of whether a given onion service is down.