MAX_REND_FAILURES is 1, but we will try three times

changed milestone to %Tor: 0.3.3.x-final in legacy/trac

added 025-backport in Legacy / Trac 029-backport in Legacy / Trac 030-backport in Legacy / Trac 031-backport in Legacy / Trac 032-backport in Legacy / Trac component::core tor/tor in Legacy / Trac milestone::Tor: 0.3.3.x-final in Legacy / Trac priority::medium in Legacy / Trac resolution::fixed in Legacy / Trac reviewer::dgoulet in Legacy / Trac severity::normal in Legacy / Trac status::closed in Legacy / Trac type::defect in Legacy / Trac labels

Trac:
Description: Off-by-2 error counting to MAX_REND_FAILURES for onion services:

In can_relaunch_service_rendezvous_point(), we check

  if (circ->build_state->failure_count > MAX_REND_FAILURES ||
      circ->build_state->expiry_time <= time(NULL)) {

to decide whether to abort the relaunch.

But the incrementing of failure_count happens in the relaunch, i.e. after this code. Also, it says ">" rather than ">=".

Yet the definition of MAX_REND_FAILURES is

/** How many times will a hidden service operator attempt to connect to a
 * requested rendezvous point before giving up? */
#define MAX_REND_FAILURES 1

So when the first attempt fails, failure_count is 0, which is not > 1, so we try again. When the second attempt fails, failure_count is 0, which is also not > 1. It's only after the third attempt fails that we decide that MAX_REND_FAILURES has been reached.

This bug affects legacy onion services, and it will also affect nextgen onion services once legacy/trac#24894 (moved) is fixed.

to

Off-by-2 error counting to MAX_REND_FAILURES for onion services:

In can_relaunch_service_rendezvous_point(), we check

  if (circ->build_state->failure_count > MAX_REND_FAILURES ||
      circ->build_state->expiry_time <= time(NULL)) {

to decide whether to abort the relaunch.

But the incrementing of failure_count happens in the relaunch, i.e. after this code. Also, it says ">" rather than ">=".

Yet the definition of MAX_REND_FAILURES is

/** How many times will a hidden service operator attempt to connect to a
 * requested rendezvous point before giving up? */
#define MAX_REND_FAILURES 1

So when the first attempt fails, failure_count is 0, which is not > 1, so we try again. When the second attempt fails, failure_count is 1, which is also not > 1. It's only after the third attempt fails that we decide that MAX_REND_FAILURES has been reached.

This bug affects legacy onion services, and it will also affect nextgen onion services once legacy/trac#24894 (moved) is fixed.

I made a bug24895 branch in my arma. It's based on maint-0.2.9, since I think fixing it that far back is wise given that 0.2.9 will remain LTS for so long.

There will probably be some simple conflicts as we forward-port, but it's a one-liner plus a comment so hopefully it won't be too bad.

Really I want to change the logic around so we increment the failure count earlier in the function, but I think we should wait until 0.3.3 for that refactoring.

Trac:
Status: new to needs_review

We'll need to backport this back to 025... Shouldn't be complicated.

I agree with Roger here that ideally we would increment the failure count much earlier that is before the check is done on the failure count.

But this is simple enough for now to backport far away and it has a comment explaining why so we aren't lost about this gymnastic.

This does affect both v2 and v3 where the code changed in 032 but not the logic and code flow.

lgtm;

Trac:
Reviewer: N/A to dgoulet
Cc: dgoulet to N/A
Keywords: N/A deleted, 030-backport, 029-backport, 031-backport, 025-backport added
Status: needs_review to merge_ready

I think reducing the effective value of MAX_REND_FAILURES from 3 to 1 will have serious reachability consequences for some onion services.

Do you think we should set MAX_REND_FAILURES to 2 instead?

Trac:
Status: merge_ready to needs_revision

Stay tuned for a revised branch where I propose that we make it into a consensus param, which defaults to 2 but which we set to 1 in the consensus for the next while.

Datapoint: In months of running v2 and v3 onions, I've seen twice a relaunch of a rendezvous point circuit. If the Guard can keep up with the service circuit creation, it is something I've rarely seen failing.

Ok agree that perhaps having two tries to reach the RP is what I think we should have in normal circumstances, not only 1 which is the current patch. I like the idea of having a consensus parameters so we can adjust accordingly depending on the network load.

But for the current network situation, I think we want to bring it down to 1 for now because right now 1 million clients introducing would be inducing two million circuits by the services they are trying to reach. At that scale, I'm ready to call it amplification attack vector.

My bug24895-029 branch adds this consensus param, and applies to maint-0.2.9. (I don't think any relevant huge services still run on 0.2.5, so I am fine leaving oldoldstable alone here.)

I picked a max of 10 (out of my hat, "chosen by fair dice roll"), and a default of 2 (based on discussion above). I could maybe have picked a default of 1, since we're going to deploy 1 on the network as soon as this patch goes out, and we have no plans to change it from 1. But I went with 2 in case we later decide to try it.

I also went with the consensus param name "maxrendfailures", which is maybe a bit vague because it doesn't specify whether I mean the client side or the service side. If somebody likes a different size bike shed, now is the time to speak up. :)

Trac:
Status: needs_revision to needs_review

This looks good! Ok, so I made two tiny winy changes here and I'll explain why. They are in the fixup commit 9e3b63b041f0e60e:

I changed the name maxrendfailures to hs_service_max_rdv_failures which, like you said above, is to specifically point it out to be service side. Secondly, because this needs to be merged forward, with v3 in 032, that value is used by both versions and moved to hs_common.h so the naming is important to clearly name space it to the "hs" subsystem and identify it as service. Third, rend in the code is associated with v2 thus the use of rdv instead. Finally, I think a name like that will be more explicit when read from the consensus or torrc file by the dirauth.
I simply made the default, min and max values a #define which, when merging forward, will be moved to hs_common.c.

This patch won't merge forward cleanly because of the big v3 change starting in 031 so here are the respective branches per version.

NOTE: The fixup commit from above is in all the branches so --autosquash might be desirable before merging. And I hope I made the merge forward OK... Let me know if anything goes wrong. There will be a very easy conflict from 029 to 030.

Branch: ticket24895_029_01, ticket24895_031_01 and ticket24895_032_01 Spec branch: ticket24895_01

For other reviewers: The tor branches are called bug... not ticket...

These look fine to me. Roger, do you like them?

Trac:
Status: needs_review to merge_ready

Sure, sounds great. Thanks.

(Once they're all merged, I'll go add the oxford comma to "Default, minimum and maximum" in master. :)

I've squashed the fixup commit and created _02 branches. The git diff should be zero with the _01 branches.

Last thing, I edited the last commit to reflect the new consensus parameter name.

bug24895_029_02, bug24895_031_02 and bug24895_032_02

woo. Merging now!

Trac:
Status: merge_ready to closed
Resolution: N/A to fixed

closed

mentioned in issue legacy/trac#24896 (moved)

moved from legacy/trac#24895 (moved)

added Bug label and removed 1 deleted label

removed 1 deleted label

MAX_REND_FAILURES is 1, but we will try three times

Child items ...

Activity