When Tor can't download microdescriptors (#21969 (moved)), maybe it should try authorities or fallbacks (#23863 (moved)), before it runs out of microdesc retries (#24113 (moved)).
But even after Tor has the microdescs it needs, it sometimes doesn't start building circuits again. Instead, it is in state "waiting for circuit".
Looks like it's more serious.
So the jammed Tor instance that generated the logs in #21969 (moved) never healed even after 7 hours of waiting. Was running TBB 7.5 so it was Tor 0.3.2.9.
Switched to info log level, closed TBB and launched again. Tor launcher never fully load, and was just blocked in the loading state with text "Establishing a Tor circuit". Captured and attached a screenshot for this. I gathered info log for few minutes in this state, and terminated by clicking cancel on Tor launcher.
Then I switched to debug log level and launched again. Tor launcher remained blocked at the same loading level as before. I left it in this state and gathered debug logs for about 20 minutes, then clicked cancel on Tor launcher to close.
Tried about 5 more times, still the same result. The only thing that fixed it was the deletion of the state file. After I deleted the state file, Tor launcher fully loaded in about 3 seconds and browsing went on just fine.
I wonder if (both?) your guards stopped responding to your extend requests, due to a DDoS defence? (Or due to overload?)
Do you still have your old guard state?
What are the versions of your guards?
Usually when guards stop responding Tor logs a message that guard X is failing more circuits than usual.
I have the guards versions yes, they are all 3 online with no downtime in the last days.
Guard 1/3: Tor 0.2.5.16
Guard 2/3: Tor 0.2.9.14
Guard 3/3: Tor 0.2.9.14
I guess DoS defense is unrelated.
The probability for all of them to overload at the same time, with no circuit failure report in my client side Tor logs, and with so many hours of downtime without recovery is kind of low but still, even if it is the case, shouldn't Tor move on to other guards that work and retry later the primary guards?
Thanks for the logs. Looking at info_25347.zip, I see that after bootstrapping to 90% we failed about 80 circuits with:
circuit_mark_for_close_(): Circuit 0 (id: 8) marked for close at command.c:589 (orig reason: 517, new reason: 0)
IIUC, this means that the client sent a CREATE cell but received a DESTROY cell (from the guard?) with reason END_CIRC_REASON_RESOURCELIMIT. RESOURCELIMIT is the reason returned by the DoS subsystem, and also by other parts of the Tor codebase when they are overworked (e.g. during OOM, or when onionskins cannot be handled in time). This could mean that perhaps your guard did not run the DoS code, and hence was overworked by the influx of clients.
All 3 guards in the primary set were running Tor versions that do not have the DoS patch, so this can be eliminated.
In regards to the 1/3 primary guard being overworked, I can't prove wrong, but:
usually this is warned in the client log "your guard X is failing a larger amount of circuits than usual. success counts are x/y - this could indicate an attack but also mean that the Tor network is overloaded ..." - there were no such messages in the client log;
why does it not heal by itself even in ~7 hours? Shouldn't it either move to another guard or do something, rather than keep running in an useless state?
Some more interesting stuff:
checked all 3 guards in the primary set and they were behaving normally, running in consensus with good bandwidth values;
after starting with a new fresh state file (like starting from scratch), Tor selected new primary guards and connected just fine. So I turned it off, manually edited the state file and put back at confirmed_idx=0 the same previous guard that was hanging before (with the old state file). Started it again, and it was working like nothing happened. Could it be a coincidence that for 7 hours that particular guard was overworked and failing circuits, and after 5 minutes when I edit the new state file it comes back to normal operation? Possible, non-zero chance for this to be true, but I wouldn't bet on it.
teor said on irc that we might be doing something with the state file or the mds of the primary guards. I tend to think in the same direction. Right now I am trying to reproduce it so I can provide two state files: the good working state file and the not working one, with the same primary guards maybe they will help us determine what is going in.
I think this is the same bug reported by cypherpunks on the parent ticket and by alec, so it might be the last annoying guard bug.
Thanks for the info s7r. I took another look at your logs (after my comment:5) and at your latest comment.
Looking at your logs, it seems like your guard rejected about 230 new circuit creations in 15 minutes with the excuse of RESOURCELIMIT. And your client just kept making more and more circuits to the same guard that were getting rejected... I've also noticed this exact same behavior on a client of mine recently.
My theory on why RESOURCELIMIT was used by your guard (given that you say that DoS patch was disabled) is that assign_onionskin_to_cpuworker() failed because onion_pending_add() failed because have_room_for_onionskin() failed. That means that the relay was overworked and had way too many cells to process at that time. Unfortunately, I can't see whether you are sending NTOR or TAP cells given your logs.
Like you said, I think the most obvious misbehavior here is that you keep on hassling your guard even tho it's telling you to relax by sending your RESOURCELIMITDESTROY cells. Perhaps one approach here would be to choose a different guard after a guard has sent us RESOURCELIMIT cells, in an attempt to unclog the guard and to get better service. Let's think about this some more:
What's the best behavior here? Should we mark the guard as down after receiving a single RESOURCELIMIT cell, or should we hassle the guard a bit before giving up?
Most importantly, can we make sure that the DESTROY cell came from the guard and not from some other node in the path? If we can make sure that the DESTROY cell came from the guard, this seem to me like a pretty safe countermeasure since we should trust the guard to tell us whether it's overworked or not.
WRT timeline here, I think working on this countermeasure (mark guard as down when overworked to get better service) seems like a plausible goal for 033, but anything more involved will probably need to wait for 034.
Would appreciate feedback from Nick or Tim here :)
I still can't explain why you managed to bootstrap after hacking your state file tho. Perhaps a coincidence? Perhaps you were overworking your guard and when you stopped, it relaxed? Perhaps the hack worked differently than you imagine? Not sure.
Like you said, I think the most obvious misbehavior here is that you keep on hassling your guard even tho it's telling you to relax by sending your RESOURCELIMITDESTROY cells. Perhaps one approach here would be to choose a different guard after a guard has sent us RESOURCELIMIT cells, in an attempt to unclog the guard and to get better service. Let's think about this some more:
What's the best behavior here? Should we mark the guard as down after receiving a single RESOURCELIMIT cell, or should we hassle the guard a bit before giving up?
Some reasons to go with the former approach here is:
a) The way primary guards work, even if we ditch the guard for a single RESOURCELIMIT cell, we are gonna retry it again after a few minutes. Also, after a few minutes there are higher chances that the guard is gonna be unclogged since we stopped hassling it.
b) It will be easier to implement.
Most importantly, can we make sure that the DESTROY cell came from the guard and not from some other node in the path? If we can make sure that the DESTROY cell came from the guard, this seem to me like a pretty safe countermeasure since we should trust the guard to tell us whether it's overworked or not.
Nick says that if we get a DESTROY cell before we receive a CREATED cell (or maybe before we send an EXTEND cell) we can be sure it comes from the guard since there is no one else on the path. I wonder if this is easy to figure out in code.
Nick also suggested we get feedback from Mike first since he will have good ideas on whether this can be bad for anonymity.
Nick says that if we get a DESTROY cell before we receive a CREATED cell (or maybe before we send an EXTEND cell) we can be sure it comes from the guard since there is no one else on the path. I wonder if this is easy to figure out in code.
Code wise this can be simple. You can either add a flag to circuit_t that says "Got a CREATED" so if True, any DESTROY cell you get, you won't be able to tell if Guard or not. I think the other way you can do that is walk the cpath and see if the first hop is in OPEN state. When you get the CREATED cell, see circuit_finish_handshake() which will set the hop state to OPEN.
After that, you get the CREATED cell embedded in an EXTENDED cell.
You don't even need the flag -- you can check how many open hops the circuit has. If it has >= 1, then it has extended at least one hop, and the DESTROY cell might not be from the guard. If it has 0 open hops, then it hasn't extended to the guard yet.
You don't even need the flag -- you can check how many open hops the circuit has. If it has >= 1, then it has extended at least one hop, and the DESTROY cell might not be from the guard. If it has 0 open hops, then it hasn't extended to the guard yet.
To clarify, the function (or check) you want to use is circuit_get_cpath_opened_len(), not circuit_get_cpath_len(). (This is also what dgoulet said in terms of the check, but I wanted to make sure the right function is used).
Wrt why there were no path bias messages -- the path bias code only counts circuits that made it to the third hop. This is because it is only checking for circuits that fail during tagging, which would allow them to survive to the 2nd hop but fail at the 3rd (due to our current lack of any MAC mechanism at hop 2). So the lack of those messages is also consistent with the guard failing with RESOURCELIMIT from itself.
Given that, I think it is reasonable to give up on the guard temporarily and treat it as if the TLS connection failed: switch to the second guard for now, with the hopes of the prop271 code going back to it eventually.
Yet another case where the adversary or just bad luck can force you to use a second guard, but I don't see a way around that. Sitting around waiting and doing nothing is arguably even worse, and leaks even more info (because for eg, the targeted HS would go down).
I pushed a branch at bug25347 in my github repo (https://github.com/asn-d6/tor). It basically implements the approach we discussed above.
A side-question here is whether there are any other DESTROY cell reasons (apart from RESOURCELIMIT) that we should mark our guard as unreachable for. For now, I'm restricting this to RESOURCELIMIT to avoid bikeshedding, but stuff like END_CIRC_REASON_HIBERNATING might also be relevant.
Unfortunately it's untested. Does anyone have any good hints on how to test this? I will be trying to do this today. Will let you know later.
I did the following chutney testing on the branch above:
a) I hacked a Tor relay to send a RESOURCELIMITDESTROY cell upon receiving a CREATE cell and made sure that the client would switch guards upon seeing the DESTROY. Worked.
b) I hacked a Tor relay to send RESOURCELIMITDESTROY cell upon receiving a RELAY cell (which AFAIK should be after the guard hop has been established) and made sure that the client would tear down the circuit but would not mark the guard as unreachable. Worked.
What other kind of testing would you like to see here?
(A) if I were designing that patch, I would put the "if circuit_get_cpath_opened_len is 0" clause up in command_process_destroy_cell(), and then rename the function to the more certain entry_guard_sent_destroy_cell(), or maybe something like entry_guard_responded_with_destroy_cell(). That way the function is clear what it's doing, and it's clear how you're deciding whether to call it, and those are kept separate.
(B) Note that in the ddos mitigation stuff, we refuse circuits from overloaders with reason resource-limit (see DOS_CC_DEFENSE_REFUSE_CELL). And at least at the time, the goal was to soak up their overload circuits at the guard level, i.e. keep them occupied at a point where we're able to limit the damage they do. If we implement this ticket, overloader clients will immediately skip off to hassle some other guard. Maybe that's ok? Or maybe it would be a sad side effect? Maybe we should change the ddos mitigation to use some other destroy reason, so we don't trigger overloaders to move? I'm not sure what's best, but let's intentionally choose something.
(C) I think you're right that END_CIRC_REASON_HIBERNATING is a good reason to include if we're doing this ticket.
(D) You're fine with circuit_get_cpath_opened_len here, but for your future knowledge: I believe clients only get a destroy cell from the guard when the guard has failed to establish the first hop. If the circuit fails after the guard, clients get a relay truncated cell back. That is, destroys currently don't chain back to the client -- if the guard gets a destroy from the middle relay, it keeps the circuit open at itself, and converts the destroy into a truncated cell to send back. In theory this design is to allow clients to say "well ok, how about extending to this other relay then", but in practice clients just say "well screw that, here's a destroy" and start over.
(E) For the advanced challenge mode of the game: I wonder if it would be useful to distinguish between preemptive circuits that we tried to make, where it's actually fine if we don't have them quite yet, vs reactive circuits that we tried to make because we have some user request that we're trying to handle right now. In the latter case we would want to try another guard because we need things to work. But in the former case I could imagine waiting for a second or two and not falling back to a different guard immediately.
(F) Are you sure that the proposed branch is related to the topic of this ticket? The ticket title and body seem quite different from what the branch does. That makes me wonder if maybe we're not actually solving the original bug report. :)
(G) Do we have a handle on how often existing guards respond with reason resource-limit? Imagine a Tor client that makes a bunch of circuits often, say because it's an active onion service. And let's say on average 5% of create cells get a resource-limit response. And let's say we wait 30 minutes before retrying a guard that we marked as down. We're going to churn through a lot of guards this way, right?
All of that stuff looks eerily similar, right? If we proceed with this ticket, we should work to unify these two "our first hop failed" cases, so we don't try to maintain two parallel versions of what to do.
(F) Are you sure that the proposed branch is related to the topic of this ticket? The ticket title and body seem quite different from what the branch does. That makes me wonder if maybe we're not actually solving the original bug report. :)
This patch is meant to address the second paragraph of the ticket body, and also is meant to directly fix the issue that s7r (and others) encountered in #21969 (moved). My understanding is that this is the main guard issue right now, and not the fact that clients can run out of md retries. Do you think this is true, teor?
(B) Note that in the ddos mitigation stuff, we refuse circuits from overloaders with reason resource-limit (see DOS_CC_DEFENSE_REFUSE_CELL). And at least at the time, the goal was to soak up their overload circuits at the guard level, i.e. keep them occupied at a point where we're able to limit the damage they do. If we implement this ticket, overloader clients will immediately skip off to hassle some other guard. Maybe that's ok? Or maybe it would be a sad side effect? Maybe we should change the ddos mitigation to use some other destroy reason, so we don't trigger overloaders to move? I'm not sure what's best, but let's intentionally choose something.
I think this is an important design-level issue here. I was actually not aware that the intention of the DoS subsystem is to soak up the overloader on a single guard, and if that's the case this patch works completely against it and we should NACK it. However, if that's the case, I'm not sure what we should do about the users who encounter this reachability issue, where they are behind an overloaded guard and they can't use it or escape it. Seems like a tradeoff between the overloader completely disabling a single guard and its users, or the overloader spreading the load across the network and potentially harming multiple relays before they kick the overloader to the next one. And of course, all that's assuming an accidental overloader who follows the Tor protocol, because if they hack their tor client to be intentionally evil they can do whatever they want.
Not sure how to proceed here. Any ideas?
(G) Do we have a handle on how often existing guards respond with reason resource-limit? Imagine a Tor client that makes a bunch of circuits often, say because it's an active onion service. And let's say on average 5% of create cells get a resource-limit response. And let's say we wait 30 minutes before retrying a guard that we marked as down. We're going to churn through a lot of guards this way, right?
That's another design-level issue we should think about. We should make sure this does not allow an attacker to simply cause an HS to cycle guards for ever, by overloading its current guards. This might be an argument for not failing the guard on a single RESOURCELIMIT but actually wait for some time of unreachability before failing the guard... This might also be an argument for switching to 2 guards, so that the adversary has a harder time to overload all guards. Not sure if we can properly do the research and implementation for this in the 033 timeframe tho...
(F) Are you sure that the proposed branch is related to the topic of this ticket? The ticket title and body seem quite different from what the branch does. That makes me wonder if maybe we're not actually solving the original bug report. :)
This patch is meant to address the second paragraph of the ticket body, and also is meant to directly fix the issue that s7r (and others) encountered in #21969 (moved). My understanding is that this is the main guard issue right now, and not the fact that clients can run out of md retries. Do you think this is true, teor?
We removed all retry limits when we removed the legacy static download schedule code. So clients can't run out of md retries any more. (I can't find the ticket right now, but we did merge the patch.)
So yes, failing to reconnect is a major guard issue. I don't know if there are any others.
Looking at your logs, it seems like your guard rejected about 230 new circuit creations in 15 minutes with the excuse of RESOURCELIMIT. And your client just kept making more and more circuits to the same guard that were getting rejected... I've also noticed this exact same behavior on a client of mine recently.
I see that as well, but this happens more often and Tor has no problems in switching to guard 2/3 or even guard 3/3 to maintain functionality. This time (it happens rarely) it completely remained in this useless state.
My theory on why RESOURCELIMIT was used by your guard (given that you say that DoS patch was disabled) is that assign_onionskin_to_cpuworker() failed because onion_pending_add() failed because have_room_for_onionskin() failed. That means that the relay was overworked and had way too many cells to process at that time. Unfortunately, I can't see whether you are sending NTOR or TAP cells given your logs.
I know for sure the DoS patch is not related because I triple checked all 3 primary guards and not even one of them was running a Tor version that includes the DoS patch we merged. I think I was using only NTOR cells, because I was only trying to reach check.tpo and duckduckgo clearnet websites.
Like you said, I think the most obvious misbehavior here is that you keep on hassling your guard even tho it's telling you to relax by sending your RESOURCELIMITDESTROY cells. Perhaps one approach here would be to choose a different guard after a guard has sent us RESOURCELIMIT cells, in an attempt to unclog the guard and to get better service. Let's think about this some more:
What's the best behavior here? Should we mark the guard as down after receiving a single RESOURCELIMIT cell, or should we hassle the guard a bit before giving up?
This is the most important part we need to take care of. I dislike the idea to remove the guard after receiving a single RESOURCELIMIT cell. At least we should retry it after some time using the exponential backoff exactly as we do when one of our primary guards is not running or not listed, and maintain the same logic, timing and behavior so we don't have to maintain more branches.
Most importantly, can we make sure that the DESTROY cell came from the guard and not from some other node in the path? If we can make sure that the DESTROY cell came from the guard, this seem to me like a pretty safe countermeasure since we should trust the guard to tell us whether it's overworked or not.
As I can understand from arma's comment the DESTROY cell can only come from the guard.
WRT timeline here, I think working on this countermeasure (mark guard as down when overworked to get better service) seems like a plausible goal for 033, but anything more involved will probably need to wait for 034.
Would appreciate feedback from Nick or Tim here :)
I still can't explain why you managed to bootstrap after hacking your state file tho. Perhaps a coincidence? Perhaps you were overworking your guard and when you stopped, it relaxed? Perhaps the hack worked differently than you imagine? Not sure.
I sincerely hope so. But it makes me think: for many hours the guard is overworked, and when I delete my state file and restart and edit again the new state file putting back all the previous 3/3 primary guards that were not allowing me to connect, it just connects fine. I don't have any evidence that there was something wrong with the state file, and I don't see what could be wrong with it, it does not make any sense. It is very hard to reproduce / catch this bug in the wild.
It is very hard to reproduce / catch this bug in the wild.
Hey @s7r, can you test the scenario mentioned in ticket:25600#comment:6? Maybe it's related to this bug.
We already treat non-destroy failures at the guard as rotate-away failures in circuit_build_failed(). Note that there is an explicit check to not blame the guard if circ->base_.received_destroy is set. If we remove that check in circuit_build_failed(), we should not need asn's patch, as the behavior would then be equivalent (or actually a superset, since it would cover all destroy reasons). (I think this is what arma meant in his point H?). This would be a simpler patch.
Roger's point G is scary. I like the solution in E, though, and I think it actually fixes G. I would implement E by setting an is_predicted flag in circuit_predict_and_launch_new(), and then check that flag in select_entry_guard_for_circuit() to completely ignore the is_reachable status there if it is set. Then if any of those predicted circuits succeed, entry_guards_node_success() will get called, which will clear the is_uneachable flag and allow us to resume using the guard. For guards that fail only a non-zero but small portion of their circuits, this should allow us to keep using them rather than rotating away. I am a fan of this plan, and could write a patch for it if we like it.
Additional wrinkles to 2: we have to ignore the ordering of our primary guards for this fix to allow predicted circuits to complete for the case of a truly downed guard. However, if all primary guards are unreachable though, we may want a different strategy to avoid spamming them with predicted circuits. Maybe first check to make sure we have at least 1 is_reachable primary guard before picking one randomly for predicted circuits? How about that?
(F) Are you sure that the proposed branch is related to the topic of this ticket? The ticket title and body seem quite different from what the branch does. That makes me wonder if maybe we're not actually solving the original bug report. :)
This patch is meant to address the second paragraph of the ticket body, and also is meant to directly fix the issue that s7r (and others) encountered in #21969 (moved). My understanding is that this is the main guard issue right now, and not the fact that clients can run out of md retries. Do you think this is true, teor?
We removed all retry limits when we removed the legacy static download schedule code. So clients can't run out of md retries any more. (I can't find the ticket right now, but we did merge the patch.)
We already treat non-destroy failures at the guard as rotate-away failures in circuit_build_failed(). Note that there is an explicit check to not blame the guard if circ->base_.received_destroy is set. If we remove that check in circuit_build_failed(), we should not need asn's patch, as the behavior would then be equivalent (or actually a superset, since it would cover all destroy reasons). (I think this is what arma meant in his point H?). This would be a simpler patch.
Seems like a reasonable plan. Let's proceed like that.
Roger's point G is scary. I like the solution in E, though, and I think it actually fixes G. I would implement E by setting an is_predicted flag in circuit_predict_and_launch_new(), and then check that flag in select_entry_guard_for_circuit() to completely ignore the is_reachable status there if it is set. Then if any of those predicted circuits succeed, entry_guards_node_success() will get called, which will clear the is_uneachable flag and allow us to resume using the guard. For guards that fail only a non-zero but small portion of their circuits, this should allow us to keep using them rather than rotating away. I am a fan of this plan, and could write a patch for it if we like it.
Hmm, I can see how the solution of E can help with G, by making Alice try her primary guards more aggressively in cases of predicted circuits, but if Alice is a busy hidden service (like Roger describes in E) most of her circs are not gonna be predicted.
Furthermore, I wonder if it can completely address the problem in actually mischievous scenarios, where the adversary overloads all 3 primary guards of Alice (targeting them using guard discovery attacks) to make her switch to other guards in her sampled set.
We already treat non-destroy failures at the guard as rotate-away failures in circuit_build_failed(). Note that there is an explicit check to not blame the guard if circ->base_.received_destroy is set. If we remove that check in circuit_build_failed(), we should not need asn's patch, as the behavior would then be equivalent (or actually a superset, since it would cover all destroy reasons). (I think this is what arma meant in his point H?). This would be a simpler patch.
Seems like a reasonable plan. Let's proceed like that.
Ok. Note that after talking to arma, I decided to file #25705 (moved). In that refactor I could also fix this issue. I guess it depends on if we want to backport that.
Roger's point G is scary. I like the solution in E, though, and I think it actually fixes G. I would implement E by setting an is_predicted flag in circuit_predict_and_launch_new(), and then check that flag in select_entry_guard_for_circuit() to completely ignore the is_reachable status there if it is set. Then if any of those predicted circuits succeed, entry_guards_node_success() will get called, which will clear the is_uneachable flag and allow us to resume using the guard. For guards that fail only a non-zero but small portion of their circuits, this should allow us to keep using them rather than rotating away. I am a fan of this plan, and could write a patch for it if we like it.
Hmm, I can see how the solution of E can help with G, by making Alice try her primary guards more aggressively in cases of predicted circuits, but if Alice is a busy hidden service (like Roger describes in E) most of her circs are not gonna be predicted.
But a client/service will always keep building predicted circuits such that it has 3 of each type available. This means even a busy hidden service will keep making predicted circuits, which, any time they succeed, will reset the fact that the guard is reachable for use by other circuits. So, so long as some circuits are succeeding through that guard, the service should keep going back to it fairly quickly.
Furthermore, I wonder if it can completely address the problem in actually mischievous scenarios, where the adversary overloads all 3 primary guards of Alice (targeting them using guard discovery attacks) to make her switch to other guards in her sampled set.
This case is tricky. I am not sure what to do with predicted circuits if all primary guards get into the unreachable state. One option is to toss a coin and pick primary vs sampled for predicted circuits, if all primary are down.
Trac: Summary: Tor stops building circuits, and doesn't start when it has enough directory information to Tor keeps on trying the same overloaded guard over and over
Roger's point G is scary. I like the solution in E, though, and I think it actually fixes G. I would implement E by setting an is_predicted flag in circuit_predict_and_launch_new(), and then check that flag in select_entry_guard_for_circuit() to completely ignore the is_reachable status there if it is set. Then if any of those predicted circuits succeed, entry_guards_node_success() will get called, which will clear the is_uneachable flag and allow us to resume using the guard. For guards that fail only a non-zero but small portion of their circuits, this should allow us to keep using them rather than rotating away. I am a fan of this plan, and could write a patch for it if we like it.
Hmm, I can see how the solution of E can help with G, by making Alice try her primary guards more aggressively in cases of predicted circuits, but if Alice is a busy hidden service (like Roger describes in E) most of her circs are not gonna be predicted.
Furthermore, I wonder if it can completely address the problem in actually mischievous scenarios, where the adversary overloads all 3 primary guards of Alice (targeting them using guard discovery attacks) to make her switch to other guards in her sampled set.
FWIW, I discussed the above mischievous case with Nickm and discussed whether we should handle this "attacker can make you switch guards" congestion-attack case specially. Nick pointed out that this is the reason that the max sample size mechanism of prop271 was introduced (so that it's impossible for an adversary to make you try all the guards). If we are unhappy with how the sampled set of prop271 works we should refactor that instead of engineering special solutions for this ticket.
tl;dr: I think we should be OK with switching guards in the case of RESOURCELIMIT, and rely on prop271 to defend against malicious attacks of this type. If we don't like the prop271 behavior, we should improve prop271.
EDIT: However I'm still worried that it might be too easy for an adversary to cause a client to rotate guards by overloading their guard. Remember that this is not even about the DoS subsystem, and it's about onionskin overload (see comment:10) and have_room_for_onionskin() seems like an easy function to overwhelm as an attacker. Hence, I'm reluctant giving the attacker the opportunity to explore the whole guard sampled set of the client (about 60 guards). Also, note that since this is about onionskins, it's an old bug and not caused by the DoS subsystem so this behavior might not be new.
Anyway, right now my intuition is to NACK the behavior of "switch guard on the first DESTROY you receive" as a client because of the security concerns. It might make sense to switch guard if we have received like 100 DESTROYS in a row, but this probably does not fit for 033 timescale. Mike, do you think differently?
It is not happening that often in order to make user experience that bad in order to force us to take a decision that might degrade security / anonymity. If that would be the case we would have hundreds of reports by now. I an not sure how often and bad it affect popular onion services that run in anonymous mode, but it looks like it can wait slightly more.
The behavior to switch guard on first DESTROY cell received as a client sounds terrible to me, I say we should NACK it. A proper behavior would be for clients to only relax a little bit after receiving say 10 DESTROY cells triggered by RESOURCELIMIT in a row, not switch the overloaded guard entirely just yet, then increase the time wait period between circuit retries so that we preserve as much as possible Tor's guard rotation period interval.
If this behavior is infrequent, then it is probably a good idea not to rotate guards unless we get a lot of destroys.
I don't like the fact that by not doing anything about this, we're allowing a confirmation/search attack where an adversary can DoS guards until a hidden service becomes (mostly) unreachable, and I would argue that such an attack is worse than moving to a different guard, but that attack could also be mitigated by just having two guards instead of one (since it is harder to keep pairs of guards offline simultaneously during a search for such a confirmation).
So I accept the NACK of the patch (and the second commit in #25705 (moved)), but I think we should not forget what this decision means wrt DoS and confirmation.
(1) It is really surprising to me that s7r could have been experiencing this bug (as currently described) for 7 hours. I think it was probably some other bug if it was really that long. For example, one of the ones where we mark things down and stop trying them, or one of the ones like #21969 (moved). asn says he only looked at a tiny fraction of the logs of the 7 hours, so let's be careful jumping to conclusions about what bug (or combination of bugs) he actually experienced.
(2) If a relay sends you a destroy cell with reason resourcelimit, it means that the relay has so many create cells on its queue that it thinks it won't get to this one in the next 1.75 seconds (MaxOnionQueueDelay default). So that's some real overload right there -- especially since even if you send another one and you don't get a destroy back, it means you squeaked into the queue, but you still have all those other creates ahead of you.
(2b) Do we have any reason to believe that the calculation in have_room_for_onionskin() is at all accurate? That is, are we sometimes sending this response when there are only 0.25 seconds worth of create cells in our queue? Or are we sometimes not sending them even though there are 5 seconds of cells queued?
(3) It would be nice to find a way for the dir auths to scale back the consensus weights of relays that are overloaded like this. That is, it would sure be swell if we could make this something that the dir auths solve for all the users, not something that each user has to encounter and then adapt to. But while I see why we want that, we should be realistic and realize that we won't get it: the dir auths act on a time schedule of hours, so they will catch perenially overloaded relays (say, relays that genuinely have a wildly wrong weighting or are just simply broken), but they won't be able to catch transient hotspots (including hotspots induced by bad people).
(4): I think we really need to figure out how how often this happens in practice. That means scanning relays over time. Now, it happens that pastly's sbws might be able to collect this data for us. Also, the torperfs and onionperfs of the world could have this data already too, if they collect it. Do they? Noticing it in sbws has the slight added advantage that if we can figure out how to use it in computing weights, it's all ready to be used.
(5) I would be a fan of a feature where we track the destroy-resource-limit responses we receive over time, and if there have been (say) 30 different seconds recently where we got at least one destroy-resource-limit, and none of our attempts worked, we call the guard down. We shouldn't call it down in response to just one hotspot though (e.g. "I sent twenty create cells and I got twenty destroy-resource-limit responses"), since they're correlated with each other, that is, if you got one then it's not surprising that you'll get a second right after. And we might want to retry a guard that we mark down this way sooner than the 30-minute-later default from prop271.
(6) I agree with asn that making it too easy to force a client to rotate guards is scary. The pool of 60-or-so guards from prop271 is a huge pool, and the only way to use that design securely imo is to have it be the case that some of those 60 guards are very hard to push clients away from.
(7) I agree with Mike that the confirmation attack ("send a bunch of create cells to each guard one at a time and see when your target onion service stops responding") is worrisome. But would a bandwidth congestion attack work there too? I guess it would be more expensive to pull off with the same level of reliability.
(7b) In a two guard design, I wonder if we should be even more reluctant to abandon a guard due to a transient problem like this. After all, if we do abandon one, we're increasing our surface area past two. And if we don't, in theory we still have one that's working.
(8) Remember CREATE_FAST? If your guard is otherwise fine but it's too busy to process your create cell... and you were about to do something foolish to your anonymity like move to another guard or go offline in response... hm. :)