Potential issue with rend cache object when intro points falls to 0.
(Reproduced on Tor v0.2.6.1-alpha-dev (git-a142fc29))
Here is the use case I was testing. I setup an HS on a remote server for perf analysis. On my client, I made a small script that torsocks 10 connections on a different circuit to that HS (considering that the SOCKS5 user/pass == unique circuit works).
With the above, one time out of 10, I get all 10 connections to successfully connect and work. The rest of the time I get an arbitrary amout of connections failing with "Host unreachable". I feel this is a combo of sometimes luck and sometimes the real issue.
I analyze this and my understanding is that the rend cache contains v2 descriptor with stored intro points ("intro_nodes" variable). However, through the cycle of trying to connect, some intro points may be unrechable thus being removed from that list. It also appears that we can remove nodes in that list when closing circuit that were built in "parallel":
Nov 04 15:36:08.000 [info] rend_client_close_other_intros(): Closing introduction circuit 25 that we built in parallel (Purpose 7).
Nov 04 15:36:08.000 [debug] circuit_get_by_circid_channel_impl(): circuit_get_by_circid_channel_impl() returning circuit 0x7f6f1a171190 for circ_id 2434373038, channel ID 0 (0x7f6f1a0425e0)
Nov 04 15:36:08.000 [info] circuit_mark_for_close_(): Failed intro circ rejxmpqgho5vqdl4 to $EBE718E1A49EE229071702964F8DB1F318075FF8 (awaiting ack). Removing from descriptor.
circuit_mark_for_close_() triggers a INTRO_POINT_FAILURE_GENERIC failure that removes the intro point from the list. I might be wrongly interpreting the "we built in parallel" feature but what I can observed is that the intro node list becomes empty at some point which triggers a "let's refetch that v2 descriptor!" behaviour.
Nov 04 15:36:08.000 [info] rend_client_report_intro_point_failure(): Unknown service "rejxmpqgho5vqdl4". Re-fetching descriptor.
However, the rend cache is not cleared of the old entry before fetching that new descriptor. So once the v2 descriptor is received, we store it in the cache using "rend_cache_store_v2_desc_as_client()" that prints this:
Nov 04 15:36:09.000 [info] rend_cache_store_v2_desc_as_client(): We already have this service descriptor rejxmpqgho5vqdl4. [rendezvous-service-descriptor i7hkcux5dghqv6ahstewyccltr6aud2x
So since we "have it" in the cache, we call "rend_client_desc_trynow()" and it completely fails because all intro points in the cache object are gone so this closes all pending connections.
Now, I think this happens because the heuristic for telling if "We already have the cache object" is just by comparing the "desc" string here in rendcommon.c +1156
/* Do we already have this descriptor? */
if (e && !strcmp(desc, e->desc)) {
log_info(LD_REND,"We already have this service descriptor %s. [%s]",
safe_str_client(service_id), desc);
e->received = time(NULL);
goto okay;
}
I think when the intro point list ends up to 0 node, we should remove it from the cache and trigger the "fetch it again".