We should not give up on the network, our TLS conn, or our guard in the event of path failures (which can happen if we're low on mds, and/or if the user set a bunch of path-restricting torrc options).
I think this might want to be a backport. It should also handle #25347 (moved).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Ok, I made the smallest possible change here. I just moved the path length check up from the guard failure block to the beginning of the function. The only thing that wasn't build failure accounting or node/network blaming was the HS RP retry code, which I also moved up.
I also removed the destroy check, as per #25347 (moved), as a separate commit. We can decide which one we like better. On the one hand, asn's branch has more checks and might be safer. On the other hand, this is a smaller change, and keeps everything related to counting circuit failures in the same place.
Ok, I made the smallest possible change here. I just moved the path length check up from the guard failure block to the beginning of the function. The only thing that wasn't build failure accounting or node/network blaming was the HS RP retry code, which I also moved up.
I also removed the destroy check, as per #25347 (moved), as a separate commit. We can decide which one we like better. On the one hand, asn's branch has more checks and might be safer. On the other hand, this is a smaller change, and keeps everything related to counting circuit failures in the same place.
mikeperry/bug25705
Thanks for the patch here Mike! A github RP would be helpful so that I can comment inline, but I'll just do it here for now:
4c64811d0b: Looks reasonable to me. I find it kinda naughty that we special handle S_CONNECT_REND twice in that function, but I didnt find a way to improve this in a straightforward way.
a2d44546c2: A few things here:
Shouldn't we do the already_marked computation even when circ->base_.received_destroy, so that we don't double-mark? Since, IIUC the only reason we distinguish between receiving DESTROY or not now, is so that we can do better logging. I think that also has the potential to simplify the code a bit. Ideally I think that whole block should go into its own function.
Do we want to mark the guard as bad for any DESTROY reason? Is there a reason we wouldn't do that? Do we remember our old rationale here. Seems like the commit that introduced the don't do anything if DESTROY logic is Roger in 5de88dda0acda6914bacd1e14c2e037798c5d9d9.
Do we want to merge this patch without taking the precautions of Roger's points G and E from #25347 (moved)? I have strong hesitation about the behavior of this commit given Roger's point G. Also see comment:31:ticket:25347.
The code needed a parenthesis to compile. We should test this new code, it seems ultra crucial!
I don't feel super strongly about the second commit. I threw it in as an afterthought so we could compare it against #25347 (moved). I guess I forgot to even check that it compiled :/
I think we can drop it, subject to a plan to move to two guards. I do not think it is at all safe to stay with one guard and allow guards to be DoS'd to the point where client activity ceases. That is a clear confirmation signal that can be used to build a guard discovery attack.
Posting a discussion with armadev from yesterday which might be relevant to this ticket. I dont have enough time to dig into this right now, so I'll just post the raw logs:
16:01 <+armadev> yeah, i am expecting that when i read 25705, i will want to remind people that right now we *do* want to call some of that code for circuits where we failed to pick the first hop,16:01 <+armadev> because that is how we rate limit the amount of thrashing that tor does when it goes offline and stays there for a while16:02 <+asn> hmm16:02 <+asn> what you mean?16:02 <+asn> i have not considered that16:02 <+armadev> that is, if you're offline and you've marked all your tors down, you wake up every so often, and try some circuits, and they all fail because you don't think anything is up, and you sleep again for a bit16:02 <+armadev> it is possible that prop#271 changed that behavior16:02 -zwiebelbot:#tor-dev- Prop#271: Another algorithm for guard selection [CLOSED]16:03 <+asn> we sleep for a bit?16:03 <+asn> do we?16:03 <+asn> i dont think we do? either with prop271 or before16:03 <+asn> do we?16:04 <+armadev> check out did_circs_fail_last_period16:04 <+armadev> if we have MAX_CIRCUIT_FAILURES failures in this period,16:04 <+armadev> and also we had MAX_CIRCUIT_FAILURES in the last period16:04 <+armadev> then we abort all circuits16:05 <+armadev> that's part of why it is important to not have too many failures that aren't really failures16:06 <+armadev> because if you have too many, you'll trigger the "woah stop for a bit" feature16:06 <+asn> hmmm16:06 <+armadev> which is good if they're really failures16:06 <+asn> i was not aware of this16:06 <+asn> feature16:07 <+armadev> now you are! :)16:07 <+armadev> it is mostly for tors that are actually offline right now16:07 <+armadev> so they thrash less16:07 <+armadev> and in that respect it's probably related to the "use less cpu when not in active use" tickets
How about a log_fn_ratelim(LOG_NOTICE/LOG_WARN) here instead of the info log, then? This is not something that should happen unless the user is doing something crazy in torrc, or there is another bug (such as not having enough mds and never getting more). In each case they/we want to know that it happened, and I also think that not giving up in these cases is a good move, even if it costs a bit of spinning.
I don't think arma's concern above was correct. I just tested mike's branch (and master) with an offline network and made sure that the MAX_CIRCUIT_FAILURES spinning protection kicks in. That happens because this ticket does not report failures only in the case of path selection failures. In the case of an offline network, the failures are not path selection related (they are circuit build related), so the failed circuits are counted correctly.
I took a second look at Mike's patch and looks good to me. Marking this as merge_ready.
Roger, are you persuaded? Asn's testing seems convincing to me, and the code looks okay, to the extent that I understand the subtleties.
Mike and Asn, how far do you think we might want to backport it? It's generally most convenient to have a branch that we plan to backport based on the earliest point to which we might want to merge it.
Ok I added the ratelimit log message and put this on maint-0.3.3 under mikeperry/bug25705_v3_033.
I think an 0.3.3 backport makes sense, because it would be nice to have this type of checking in place for the HSLayerXNodes options. I am less sure it needs a further backport since we nacked the #25347 (moved)-related change.
In addition to what asn said, the other thing that makes Roger's second concern not happen is that this patch bails before incrementing n_circuit_failures. So these failures won't trigger the "woah go to sleep" property. I believe that is exactly what we want for these types of failures, though. They should not cause us to blame the guard or give up on the network. Neither are at fault.