In legacy/trac#15463 (moved), we're seeing an effective denial of service against a HS with a flood of introductions. The service falls apart trying to build rendezvous circuits, resulting in 100% CPU usage, many failed circuits, and impact on the guard.
We should consider dropping INTRODUCE2 cells when the HS is under too much load to build rendezvous circuits successfully. It's much better if the HS response in this situation is predictable, instead of hammering at the guard until something falls down.
One option is to add a HSMaxConnectionRate(?) option defining the number of INTRODUCE2 we would accept per 10(?) minutes, maybe with some bursting behavior. It's unclear what a useful default value would be.
We could try to use a heuristic based on when rend circuits start failing, but it's not obvious to me how that would work.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
We should probably queue INTRODUCE2 cells, and act on them the best we can. If the queue grows too big (we are under DoS), we should drop cells enough so that we (and our guard) can handle the load.
This seems like queuing theory stuff, and specifically active queue management. Yawning suggested looking into algorithms like Stochastic Fair Blue and CoDeL .
I'd actually like to some exploration of initial throttling or dropping or queueing at the intro point as well. That was originally meant to be the first line of defense here.
(In a related design, the hs might consider which intro points the intro2 cells are arriving from, and if they're all arriving from one intro point, take that into account.)
It is impossible that we will fix all 226 currently open 028 tickets before 028 releases. Time to move some out. This is my second pass through the "new" and tickets, looking for things to move to 0.2.9.
Trac: Milestone: Tor: 0.2.8.x-final to Tor: 0.2.9.x-final
Remove the SponsorU status from these items, which we already decided to defer from 0.2.9. add the SponsorU-deferred tag instead in case we ever want to remember which ones these were.
I'd actually like to some exploration of initial throttling or dropping or queueing at the intro point as well. That was originally meant to be the first line of defense here.
Here's my concrete proposal on this one: the intro point should see if the package window for the intro circuit is empty, and if so, it should nack the intro1 cell. That way there are at most 1000 intro2 cells in flight at once from that intro point.
This design is reasonable because it takes a long while for an onion service to process 1000 intro2 cells, so if we queue later ones and send them 'eventually', they're going to arrive much later, and the client will likely have timed out and moved on from that rendezvous point. So we're not harming legitimate clients who end up in this situation, because the current behavior is already harming them plenty.
The benefits are that (a) the onion service doesn't receive the excess intro2 cells that it wasn't going to be able to rendezvous with anyway, (b) clients get a much faster feedback that things aren't going to work so they can move to another intro point, and (c) when a DoS stops, the pain stops soon after: there isn't a huge queue of waiting intro2 cells that have to slowly drain, for no value.
We could imagine an extension on this idea, where the intro point silently drops the excess intro1 cells, rather than explicitly nacking them. This variant will force the client to time out rather than immediately try the next intro point, thus slowing down attacks by clients that follow the protocol. (Modified clients could still use a smaller timeout, or not even care whether they get a response.)
Another idea I was considering here, but ultimately abandoned as more complex than we need, was to somehow timestamp the intro1 cell when it gets received at the intro point, which would allow the onion service to examine how many seconds have passed and discard it if it's more than n seconds ago. That would essentially mean that we have n seconds of valid intro2 cells in flight, rather than at-most-n circwindows of intro2 cells in flight. This approach would handle congestion that happens inside the network (between the intro point and the service), in that if it takes a long time for the intro2 cell to make it from the intro point to the onion service, it's less likely that the client is still around and waiting for the connect-back.
But how exactly to do the timestamp, and how and whether we need to synchronize clocks, made this too klunky an idea.
Because legacy/trac#30440 (moved) won't be a mature thing in the network for many years to come, we can only use the "package_window" proposal once it is.
So until then, we'll use a token bucket system, add knobs in the consensus (like the dos.c subsystem) and go on from there. Not sure how we are going to come up with the values but they need to be large enough so it doesn't affect legit busy HS.
I did a review on the code without considering the higher-level design here. I will think more about the numbers and such and reply to the tor-dev thread today or tomorrow.
Thanks for the updates David! Only a single nit remains on the GH (and maybe also open the tokenbucket ticket so that we don't forget?).
As a further thing: I lost track of the experimental results of this ticket when I went to AllHands. I now don't rememember exactly how this ticket affects (a) the health of the network and (b) the availability of the service. Any chance you could update us on these two thigns in the tor-dev mailing list? I think it would be great to have this documented so that we know what exactly we are doing by merging this patch.
Marking as needs_revision for these last bits of action.
As a further thing: I lost track of the experimental results of this ticket when I went to AllHands. I now don't rememember exactly how this ticket affects (a) the health of the network and (b) the availability of the service. Any chance you could update us on these two thigns in the tor-dev mailing list? I think it would be great to have this documented so that we know what exactly we are doing by merging this patch.
Yes I can do this!
With legacy/trac#30924 (moved), we'll have a more complete feature and we should at that point probably blog post this entire new defense and how to leverage it.