Clients fail on the 7th rapid SOCKS request to the same HS

changed milestone to %Tor: 0.2.8.x-final

added component::core tor/tor milestone::Tor: 0.2.8.x-final owner::dgoulet points::small/medium priority::low resolution::fixed severity::normal sponsor::R-must status::closed tor-hs type::defect labels

Trac:
hsdir-multi-request-bug.tgz

Logs of HSDirs, Client, and HS in hs-100-clients chutney network

Trac:
Keywords: N/A deleted, tor-hs added
Milestone: Tor: 0.2.8.x-final to Tor: 0.2.7.x-final

On second thoughts, there is one scenario where TorBrowser could trigger this behavior on a system with a fast CPU and slow network link:

Bookmark the same onion site 7 times in a bookmarks folder
Wait until the cached HS entry expires
Click "Open All In Tabs"
If TorBrowser opens the tabs fast enough, it could trigger tor to launch 6 requests then fail on the 7th, while still waiting for the other 6 to connect

This issue has also been logged as #16501 (closed), in the following scenario:

 I have a small threaded program that connects to the tor socks port and issues a GET request on hidden services.

I noticed that if it makes 7 concurrent calls to one hidden service, then all of the requests will fail immediately.

#16501 (closed) also has further log details and the same diagnosis I provided above. It's been closed as a duplicate.

It sounds like the Tor client isn't properly checking to see if there is already an hsdesc fetch in progress.

(If there is, the right behavior is to put this stream into whatever state it is that pending streams go into such that when the hsdesc finishes, it notifies all streams in that state so they can move forward.)

dgoulet was working on this, but I'm not sure where he got up to.

I can confirm it's still an issue in the latest git sources, as I triggered this bug using src/test/test-network.sh --connections 10 --flavour hs --sleep 30 as I was testing the chutney performance measurement changes in #14175 (moved).

Trac:
Owner: N/A to dgoulet
Keywords: N/A deleted, TorCoreTeam201508 added
Status: new to assigned

Trac:
Keywords: TorCoreTeam201508 deleted, TorCoreTeam201509 added

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.8.x-final

Bulk-replace SponsorR keyword with SponsorR sponsor field in Tor component.

Trac:
Keywords: SponsorR deleted, N/A added
Sponsor: N/A to SponsorR

Trac:
Keywords: TorCoreTeam201509 deleted, N/A added
Points: N/A to small/medium

Trac:
Priority: Medium to Low

Hrm ok it's a fun puzzle but I think I figured it out and the solution could be simple.

In rend_client_refetch_v2_renddesc(), at the bottom if the fetch request fails by not finding a new HSDir because all 6 have been queried already, we call rend_client_desc_trynow() which closes all RENDDESC_WAIT connections for a .onion thus losing all previous 6.

I can't figure out why we do that if we can't find an HSDir... so can we simply remove this?:

  if (ret <= 0) {
    /* Close pending connections on error or if no hsdir can be found. */
    rend_client_desc_trynow(rend_query->onion_address);
  }

When the fetch succeeds (in connection_dir_client_reached_eof()), we already call that function to move to the next stage for HS connection so I propose we remove the above.

Trac:
Status: assigned to needs_information
Severity: N/A to Normal

This is not a bug.

Trac:
Status: needs_information to closed
Resolution: N/A to not a bug

Replying to cypherpunks:

This is not a bug.

Cool cool... please explain why. Reopening...

Trac:
Status: closed to reopened
Resolution: not a bug to N/A

Trac:
Status: reopened to closed
Resolution: N/A to not a bug

Trac:
Resolution: not a bug to N/A
Status: closed to reopened

Replying to dgoulet:

Hrm ok it's a fun puzzle but I think I figured it out and the solution could be simple.

In rend_client_refetch_v2_renddesc(), at the bottom if the fetch request fails by not finding a new HSDir because all 6 have been queried already, we call rend_client_desc_trynow() which closes all RENDDESC_WAIT connections for a .onion thus losing all previous 6.

I can't figure out why we do that if we can't find an HSDir... so can we simply remove this?: {{{ if (ret <= 0) { /* Close pending connections on error or if no hsdir can be found. */ rend_client_desc_trynow(rend_query->onion_address); } }}}

When the fetch succeeds (in connection_dir_client_reached_eof()), we already call that function to move to the next stage for HS connection so I propose we remove the above.

Makes sense to me, but do we limit the number of pending connections to one per HSDir? And do we retry when the connections actually fail?

(I don't know this code very well. I am concerned that we could end up asking a HSDir multiple times, or end up stalling if the network is slow when we first try all 6, but speeds up later.)

Replying to teor:

Replying to dgoulet:

Hrm ok it's a fun puzzle but I think I figured it out and the solution could be simple.

In rend_client_refetch_v2_renddesc(), at the bottom if the fetch request fails by not finding a new HSDir because all 6 have been queried already, we call rend_client_desc_trynow() which closes all RENDDESC_WAIT connections for a .onion thus losing all previous 6.

I can't figure out why we do that if we can't find an HSDir... so can we simply remove this?: {{{ if (ret <= 0) { /* Close pending connections on error or if no hsdir can be found. */ rend_client_desc_trynow(rend_query->onion_address); } }}}

When the fetch succeeds (in connection_dir_client_reached_eof()), we already call that function to move to the next stage for HS connection so I propose we remove the above.

Makes sense to me, but do we limit the number of pending connections to one per HSDir?

One socks connection can query all 6 HSDirs.

And do we retry when the connections actually fail?

There is a "cache" of last queried HSDir that is we don't query an HSDir more than once for the same request. Once the descriptor arrives or the SOCKS connection ends, the cache is purge for the requested onion.

(I don't know this code very well. I am concerned that we could end up asking a HSDir multiple times, or end up stalling if the network is slow when we first try all 6, but speeds up later.)

See last_hid_serv_requests in rendclient.c for more info.

This behaviour is unlikely to be triggered by HTML-based hidden services. Unless you're performing an attack. Should further tickets like this one be filed to make other attacks easier?

Trac:
Reviewer: N/A to N/A
Status: reopened to needs_revision

See branch bug15937_028_01.

Tested and the 7th connection no longer kill all the others. Easy test:

for i in `seq 1 7`; do
	torsocks wget http://someaddress.onion &
done

Note that once the descriptor arrives, the 5 other HSDir connections are not terminated immediately. It doesn't cause a bug or anything, it's maybe an unnecessary load on the network to keep them finishing. I'll open a bug later about that.

Trac:
Status: needs_revision to needs_review

hm, that seems easy. Thanks also for the comment. Two questions:

Is there any reason that rend_client_fetch_v2_desc() could fail other than already having 6 directory connections? If so the check should turn into if ret < 0, right?
Yes, please open that other bug.
Do we really want to open a number of directory fetches that depends on how many client requests we got at once? That behavior seems kind of weird. If you agree, please open a bug for 0.2.9?

Replying to nickm:

hm, that seems easy. Thanks also for the comment. Two questions:

Is there any reason that rend_client_fetch_v2_desc() could fail other than already having 6 directory connections? If so the check should turn into if ret < 0, right?

Yes it can fail for other reasons but since rend_client_refetch_v2_renddesc() is a void function, there is not much we can do if it does fail. The appropriate logs will be displayed though.

Yes, please open that other bug.

On my stack.

Do we really want to open a number of directory fetches that depends on how many client requests we got at once? That behavior seems kind of weird. If you agree, please open a bug for 0.2.9?

Indeed. We shouldn't do that much requests (unless some historical reasons made us do that). I think it should be only 2 or 3 (max half) in parallel and once we have the descriptor, close the pending ones. That's actually the behavior with introduction points.

I'll open a bug for 029 and if it's not what we want in the end, we'll comment there. Thx!