Off-by-2 error counting to MAX_REND_FAILURES for onion services:
In can_relaunch_service_rendezvous_point(), we check
if (circ->build_state->failure_count > MAX_REND_FAILURES || circ->build_state->expiry_time <= time(NULL)) {
to decide whether to abort the relaunch.
But the incrementing of failure_count happens in the relaunch, i.e. after this code. Also, it says ">" rather than ">=".
Yet the definition of MAX_REND_FAILURES is
/** How many times will a hidden service operator attempt to connect to a * requested rendezvous point before giving up? */#define MAX_REND_FAILURES 1
So when the first attempt fails, failure_count is 0, which is not > 1, so we try again. When the second attempt fails, failure_count is 1, which is also not > 1. It's only after the third attempt fails that we decide that MAX_REND_FAILURES has been reached.
This bug affects legacy onion services, and it will also affect nextgen onion services once legacy/trac#24894 (moved) is fixed.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Trac: Description: Off-by-2 error counting to MAX_REND_FAILURES for onion services:
In can_relaunch_service_rendezvous_point(), we check
if (circ->build_state->failure_count > MAX_REND_FAILURES || circ->build_state->expiry_time <= time(NULL)) {
to decide whether to abort the relaunch.
But the incrementing of failure_count happens in the relaunch, i.e. after this code. Also, it says ">" rather than ">=".
Yet the definition of MAX_REND_FAILURES is
/** How many times will a hidden service operator attempt to connect to a * requested rendezvous point before giving up? */#define MAX_REND_FAILURES 1
So when the first attempt fails, failure_count is 0, which is not > 1, so we try again. When the second attempt fails, failure_count is 0, which is also not > 1. It's only after the third attempt fails that we decide that MAX_REND_FAILURES has been reached.
This bug affects legacy onion services, and it will also affect nextgen onion services once legacy/trac#24894 (moved) is fixed.
to
Off-by-2 error counting to MAX_REND_FAILURES for onion services:
In can_relaunch_service_rendezvous_point(), we check
if (circ->build_state->failure_count > MAX_REND_FAILURES || circ->build_state->expiry_time <= time(NULL)) {
to decide whether to abort the relaunch.
But the incrementing of failure_count happens in the relaunch, i.e. after this code. Also, it says ">" rather than ">=".
Yet the definition of MAX_REND_FAILURES is
/** How many times will a hidden service operator attempt to connect to a * requested rendezvous point before giving up? */#define MAX_REND_FAILURES 1
So when the first attempt fails, failure_count is 0, which is not > 1, so we try again. When the second attempt fails, failure_count is 1, which is also not > 1. It's only after the third attempt fails that we decide that MAX_REND_FAILURES has been reached.
This bug affects legacy onion services, and it will also affect nextgen onion services once legacy/trac#24894 (moved) is fixed.
I made a bug24895 branch in my arma. It's based on maint-0.2.9, since I think fixing it that far back is wise given that 0.2.9 will remain LTS for so long.
There will probably be some simple conflicts as we forward-port, but it's a one-liner plus a comment so hopefully it won't be too bad.
Really I want to change the logic around so we increment the failure count earlier in the function, but I think we should wait until 0.3.3 for that refactoring.
Stay tuned for a revised branch where I propose that we make it into a consensus param, which defaults to 2 but which we set to 1 in the consensus for the next while.
Datapoint: In months of running v2 and v3 onions, I've seen twice a relaunch of a rendezvous point circuit. If the Guard can keep up with the service circuit creation, it is something I've rarely seen failing.
Ok agree that perhaps having two tries to reach the RP is what I think we should have in normal circumstances, not only 1 which is the current patch. I like the idea of having a consensus parameters so we can adjust accordingly depending on the network load.
But for the current network situation, I think we want to bring it down to 1 for now because right now 1 million clients introducing would be inducing two million circuits by the services they are trying to reach. At that scale, I'm ready to call it amplification attack vector.
My bug24895-029 branch adds this consensus param, and applies to maint-0.2.9. (I don't think any relevant huge services still run on 0.2.5, so I am fine leaving oldoldstable alone here.)
I picked a max of 10 (out of my hat, "chosen by fair dice roll"), and a default of 2 (based on discussion above). I could maybe have picked a default of 1, since we're going to deploy 1 on the network as soon as this patch goes out, and we have no plans to change it from 1. But I went with 2 in case we later decide to try it.
I also went with the consensus param name "maxrendfailures", which is maybe a bit vague because it doesn't specify whether I mean the client side or the service side. If somebody likes a different size bike shed, now is the time to speak up. :)
This looks good! Ok, so I made two tiny winy changes here and I'll explain why. They are in the fixup commit 9e3b63b041f0e60e:
I changed the name maxrendfailures to hs_service_max_rdv_failures which, like you said above, is to specifically point it out to be service side. Secondly, because this needs to be merged forward, with v3 in 032, that value is used by both versions and moved to hs_common.h so the naming is important to clearly name space it to the "hs" subsystem and identify it as service. Third, rend in the code is associated with v2 thus the use of rdv instead. Finally, I think a name like that will be more explicit when read from the consensus or torrc file by the dirauth.
I simply made the default, min and max values a #define which, when merging forward, will be moved to hs_common.c.
This patch won't merge forward cleanly because of the big v3 change starting in 031 so here are the respective branches per version.
NOTE: The fixup commit from above is in all the branches so --autosquash might be desirable before merging. And I hope I made the merge forward OK... Let me know if anything goes wrong. There will be a very easy conflict from 029 to 030.
Branch: ticket24895_029_01, ticket24895_031_01 and ticket24895_032_01
Spec branch: ticket24895_01