If a tor client gets 7 SOCKS connection requests to a hidden service with an uncached descriptor in rapid succession, it launches a HSDir request for each connection. It doesn't wait for the circuit to any of the HSDirs to be built, or wait for a response from any of the HSDirs.
After 6 connections, it fails on the 7th attempted fetch, because it has tried to (rapidly) fetch the descriptor 6 times, and hasn't got it yet. It then fails each of the outstanding SOCKS requests 1-7.
It can do this all in the same second, before any circuits have a chance to be built.
The client then backs off after the descriptor fetch failure, the circuit build attempts succeed, and then the 8th and subsequent requests succeed.
This behaviour is unlikely to be triggered by HTML-based hidden services. There is typiclly 1 connection with an uncached descriptor to load the initial HTML page, then further connections with a cached descriptor load any page resources.
Observed in 0.2.7.0-dev.
tor logs attached for 3 HSDirs, 1 client, and 1 HS in a chutney network hs-100-clients (branch with these changes in chutney upcoming in #15936 (moved))
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
On second thoughts, there is one scenario where TorBrowser could trigger this behavior on a system with a fast CPU and slow network link:
Bookmark the same onion site 7 times in a bookmarks folder
Wait until the cached HS entry expires
Click "Open All In Tabs"
If TorBrowser opens the tabs fast enough, it could trigger tor to launch 6 requests then fail on the 7th, while still waiting for the other 6 to connect
This issue has also been logged as #16501 (closed), in the following scenario:
I have a small threaded program that connects to the tor socks port and issues a GET request on hidden services.I noticed that if it makes 7 concurrent calls to one hidden service, then all of the requests will fail immediately.
#16501 (closed) also has further log details and the same diagnosis I provided above. It's been closed as a duplicate.
It sounds like the Tor client isn't properly checking to see if there is already an hsdesc fetch in progress.
(If there is, the right behavior is to put this stream into whatever state it is that pending streams go into such that when the hsdesc finishes, it notifies all streams in that state so they can move forward.)
dgoulet was working on this, but I'm not sure where he got up to.
I can confirm it's still an issue in the latest git sources, as I triggered this bug using src/test/test-network.sh --connections 10 --flavour hs --sleep 30
as I was testing the chutney performance measurement changes in #14175 (moved).
Hrm ok it's a fun puzzle but I think I figured it out and the solution could be simple.
In rend_client_refetch_v2_renddesc(), at the bottom if the fetch request fails by not finding a new HSDir because all 6 have been queried already, we call rend_client_desc_trynow() which closes all RENDDESC_WAIT connections for a .onion thus losing all previous 6.
I can't figure out why we do that if we can't find an HSDir... so can we simply remove this?:
if (ret <= 0) { /* Close pending connections on error or if no hsdir can be found. */ rend_client_desc_trynow(rend_query->onion_address); }
When the fetch succeeds (in connection_dir_client_reached_eof()), we already call that function to move to the next stage for HS connection so I propose we remove the above.
Trac: Status: assigned to needs_information Severity: N/Ato Normal
Hrm ok it's a fun puzzle but I think I figured it out and the solution could be simple.
In rend_client_refetch_v2_renddesc(), at the bottom if the fetch request fails by not finding a new HSDir because all 6 have been queried already, we call rend_client_desc_trynow() which closes all RENDDESC_WAIT connections for a .onion thus losing all previous 6.
I can't figure out why we do that if we can't find an HSDir... so can we simply remove this?:
{{{
if (ret <= 0) {
/* Close pending connections on error or if no hsdir can be found. */
rend_client_desc_trynow(rend_query->onion_address);
}
}}}
When the fetch succeeds (in connection_dir_client_reached_eof()), we already call that function to move to the next stage for HS connection so I propose we remove the above.
Makes sense to me, but do we limit the number of pending connections to one per HSDir?
And do we retry when the connections actually fail?
(I don't know this code very well. I am concerned that we could end up asking a HSDir multiple times, or end up stalling if the network is slow when we first try all 6, but speeds up later.)
Hrm ok it's a fun puzzle but I think I figured it out and the solution could be simple.
In rend_client_refetch_v2_renddesc(), at the bottom if the fetch request fails by not finding a new HSDir because all 6 have been queried already, we call rend_client_desc_trynow() which closes all RENDDESC_WAIT connections for a .onion thus losing all previous 6.
I can't figure out why we do that if we can't find an HSDir... so can we simply remove this?:
{{{
if (ret <= 0) {
/* Close pending connections on error or if no hsdir can be found. */
rend_client_desc_trynow(rend_query->onion_address);
}
}}}
When the fetch succeeds (in connection_dir_client_reached_eof()), we already call that function to move to the next stage for HS connection so I propose we remove the above.
Makes sense to me, but do we limit the number of pending connections to one per HSDir?
One socks connection can query all 6 HSDirs.
And do we retry when the connections actually fail?
There is a "cache" of last queried HSDir that is we don't query an HSDir more than once for the same request. Once the descriptor arrives or the SOCKS connection ends, the cache is purge for the requested onion.
(I don't know this code very well. I am concerned that we could end up asking a HSDir multiple times, or end up stalling if the network is slow when we first try all 6, but speeds up later.)
See last_hid_serv_requests in rendclient.c for more info.
This behaviour is unlikely to be triggered by HTML-based hidden services.
Unless you're performing an attack. Should further tickets like this one be filed to make other attacks easier?
Tested and the 7th connection no longer kill all the others. Easy test:
for i in `seq 1 7`; do torsocks wget http://someaddress.onion &done
Note that once the descriptor arrives, the 5 other HSDir connections are not terminated immediately. It doesn't cause a bug or anything, it's maybe an unnecessary load on the network to keep them finishing. I'll open a bug later about that.
hm, that seems easy. Thanks also for the comment. Two questions:
Is there any reason that rend_client_fetch_v2_desc() could fail other than already having 6 directory connections? If so the check should turn into if ret < 0, right?
Yes, please open that other bug.
Do we really want to open a number of directory fetches that depends on how many client requests we got at once? That behavior seems kind of weird. If you agree, please open a bug for 0.2.9?
hm, that seems easy. Thanks also for the comment. Two questions:
Is there any reason that rend_client_fetch_v2_desc() could fail other than already having 6 directory connections? If so the check should turn into if ret < 0, right?
Yes it can fail for other reasons but since rend_client_refetch_v2_renddesc() is a void function, there is not much we can do if it does fail. The appropriate logs will be displayed though.
Yes, please open that other bug.
On my stack.
Do we really want to open a number of directory fetches that depends on how many client requests we got at once? That behavior seems kind of weird. If you agree, please open a bug for 0.2.9?
Indeed. We shouldn't do that much requests (unless some historical reasons made us do that). I think it should be only 2 or 3 (max half) in parallel and once we have the descriptor, close the pending ones. That's actually the behavior with introduction points.
I'll open a bug for 029 and if it's not what we want in the end, we'll comment there. Thx!