There are classes of relay cells that don't count in the circuit windows. Specifically, only data relay cells count in the circuit windows.
So for example, a malicious website could arrange for the client to open 2200 streams to it, and then the website could close them all at once. 2200 relay end cells would work their way back towards the client. The #9063 (moved) fix would get upset and kill the circuit.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
A) A website just causes Alice to launch 2200 stream connections. Then she gets 2200 connected cells heading her way.
B) If Alice is using the leaky pipe design and has two circ-window-maxes heading her way, but she's also uploading stuff and triggers a sendme cell to return to her, that brings us to 2001.
Actually, this is all clients. Bob pointed out that it is their Guards who have to upgrade for them to be vulnerable, not them.
Pretty bad stuff.
Perhaps we should remove #9063 (moved) and save it for Gaurds who actually experience OOMing?
Trac: Summary: #9063 (moved) enables Guard discovery of 0.2.4.x clients in about an hour by websites to#9063 (moved) enables Guard discovery in about an hour by websites Priority: major to critical
Hm. Targetted OOM, or guard discovery. Not a great choice.
We might need to do the 0.2.5 version of a #9063 (moved) fix that we were talking about, where we defined a maximum space allocatable for cells, and do some kind of OOM killer on on clogged-looking circuits if we run out of space for new queued cells. Not great, and not delightful to contemplate for an 0.2.4 backport, but the OOM situation isn't greater either.
This doesn't make discovering a client's entry guards significantly easier. It does make failing a client's circuits from the exit/destination end significantly easier, and as soon as the path-bias detector is turned on, that becomes a full remote user-tracing attack.
Here are a few sketch oom-killer algorithms. Let's try to find flaws in them and find something better.
Algorithm 1.
have a max number of allocated cells configured.
when that number is reached, consider all circuits and kill those with the longest queues until we are at X% of the limit.
Algorithm 2.
as in alg 1 above.
when he hit the limit, consider the connections with the largest number of queued cells on circuits aiming for them.
kill those connections until we are at X% of the limit.
Algorithm 3.
1, 2) as in alg 2 above.
3) kill the biggest circuits on those connections until we are at X% of our limit.
Algorithms 1T, 2T, 3T:
As algorithms 1, 2, 3, but instead of longest queue, look at longest flush time, where flush time is the queue length divided by the connection's observed bandwidth.
As algorithms 1T 2T and 3T, but define a queue's flush time as some function that accounts for the number of other circuits targetting the same connection. If we were using round robin, I would say: N/BW where bw is the conn observed bandwidth and N is the sum of all queue lengths on that connection, each queue length clippedd to be no longer than the queue under consideration.
Instead of closing circuits or connections in response to OOM, you could check whether the relay has hit its rate limit, and if it hasn't, send a bunch of cells along the circuit/connection. (Perhaps the next relay will have more room for them.)
"Maximum number of cells in entrys's circuit queue = 8 * (1000_data + 10_sendmecirc + 20_sendme_stream + N * (1_connected + 1_end)), N is number of maximum allowed streams for client"
"so 10k is actual number for client with 100 streams limit"
(quoting from irc)
If we disable the leaky pipe feature, 2k cells is enough, with 100 stream limit.
Of course, this isn't counting drop cells, which we might want to add in at some point to help with website fingerprinting.
As algorithms 1T 2T and 3T, but define a queue's flush time as some function that accounts for the number of other circuits targetting the same connection. If we were using round robin, I would say: N/BW where bw is the conn observed bandwidth and N is the sum of all queue lengths on that connection, each queue length clippedd to be no longer than the queue under consideration.
Is there any reason not to simplify this down to the circuit level only? Anything based on connections is easily subverted by a sybil attack. And you can't consider the time-independent queue length alone because of the sybil attack.
The goal is to drive up the cost of the attack while making sure you are not killing an honest client's circuit. If you consider a function of the queue length and the waiting time of the longest waiting cell for each circuit, then a malicious client will have to read cells at a rate at least as fast as the slowest honest client on every sybil in order to cause the relay to target the honest client's circuit. It can be shown that the longer the attacker's queue becomes, the faster it must read from it to avoid selection because it needs to keep the waiting times of the cells in the malicious circuits lower than those of all honest circuits.
I guess authenticated SNEDMEs, where the exit inserts a 1 byte nonce into every 100th cell and won't increase the package window unless the SENDME contains the nonce, is also off of the table since like the queue length limit defense, it doesn't handle non-data cells.
Instead of closing circuits or connections in response to OOM, you could check whether the relay has hit its rate limit, and if it hasn't, send a bunch of cells along the circuit/connection. (Perhaps the next relay will have more room for them.)
That sounds plausible as a reliability improvement, but unless I misunderstand, it won't actually prevent any DOS here, since the recipient could just stop reading and thereby make the sender's TCP implementation refuse to write any more.
How long are cell queues for normal clients' circuits and connections in practice?
If we disable the leaky pipe feature, 2k cells is enough, with 100 stream limit.
Is there any such 100 stream limit? I'm not seeing it in the code today. So if we want to protect existing clients from this issue, 2k is too low. We'd need to write a patch to limit that, which would create another way for a hostile website to make a client open a ton of circuits. (We could mitigate that a little by applying a limit only to the number of circuits for which we've sent a BEGIN but not gotten a CONNECTED, I guess, and delaying pending streams
So if I'm not wrong about the code as it stands, that implies that N=65535, so the magic number is a hefty 1056816 (or merely 264204 if you don't believe in leaky-pipe). Not so good.
Or am I missing a real 100-stream limit somewhere?
I guess authenticated SNEDMEs, where the exit inserts a 1 byte nonce into every 100th cell and won't increase the package window unless the SENDME contains the nonce, is also off of the table since like the queue length limit defense, it doesn't handle non-data cells.
I imagine its not a simple fix, but what are the other problems with making all classes of relay cells count against the circuit package window?
The biggest problems I know with each of these are the migration issues. We can't require that clients and exits use them until all clients and exits support them. That doesn't make them complete non-starters, but it does keep us from deploying them any time, say, in the next two weeks. :)
Is there any reason not to simplify this down to the circuit level only? Anything based on connections is easily subverted by a sybil attack. And you can't consider the time-independent queue length alone because of the sybil attack.
What you say seems plausible. Though it might not be too bad to do a time-independent queue length as "good enough for now" in 0.2.4 and maybe 0.2.3, then work on the better thing in 0.2.5, targeting a possible backport to 0.2.4 if the better thing turns out to be not too hard. After all, if the #9063 (moved) solution seemed "good enough as a temporary measure" (barring this issue) then surely a queue-length-based solution is not worse.
I think this here (algorithm 1 in 0.2.3 and 0.2.4; work on better stuff for 0.2.5 for a possible backport) is my current preferred approach, assuming there aren't flaws in it beyond what I know. Finding the exactly-right approach for 0.2.5 seems like it might take a bunch of discussion and analysis.
The goal is to drive up the cost of the attack while making sure you are not killing an honest client's circuit. If you consider a function of the queue length and the waiting time of the longest waiting cell for each circuit, then a malicious client will have to read cells at a rate at least as fast as the slowest honest client on every sybil in order to cause the relay to target the honest client's circuit. It can be shown that the longer the attacker's queue becomes, the faster it must read from it to avoid selection because it needs to keep the waiting times of the cells in the malicious circuits lower than those of all honest circuits.
You don't want a function of the longest waiting cell for each circuit, I think: a you want a function of the expected time to drain the queue.[*] That's why I was looking at connection flush rates -- the speed at which the connection is draining, and the lengths of the other circuit queues on the connection, determine how long it will take to drain the queue.
[*] Why look at the expected drain time and not age of the longest waiting cell? Imagine a slow connection where a bunch of legit circuits have had modestly long queues for a while, and a new connection has gotten a really egregious queue really quickly. The ones that have been around longer seem like they could be the ones to get killed, which isn't so great.
I'm going to suggest we keep this open for now to discuss what fix we want to do in 0.2.3/0.2.4.
Trac: Milestone: Tor: 0.2.3.x-final to Tor: 0.2.4.x-final Priority: critical to major Resolution: fixed toN/A Status: closed to reopened Component: Tor to Thandy
I think this here (algorithm 1 in 0.2.3 and 0.2.4; work on better stuff for 0.2.5 for a possible backport) is my current preferred approach, assuming there aren't flaws in it beyond what I know. Finding the exactly-right approach for 0.2.5 seems like it might take a bunch of discussion and analysis.
See #9063 (moved) for my attempt at hacking that together. It needs review.
"idea with streams limiting for #9072 (moved) is wrong too. idea with limiting half open streams is better but still fragile. no way to limit queue length with current protocol. and don't forget about many relay_truncate, relay_truncated, that can be used actively with leaky pipe topology."
I asked him: what if we made a patch that was a torrc option, so we at least had something in a release we could tell people "Turn this on if you're a guard and experience ooms?" how should we implement that option? is there a least bad choice?
limiting streams to limit queue length to protect against killer website but protect against oom is kludge. need to think about torrc option.
Here are a few sketch oom-killer algorithms. Let's try to find flaws in them and find something better.
Algorithm 1.
have a max number of allocated cells configured.
when that number is reached, consider all circuits and kill those with the longest queues until we are at X% of the limit.
Making sure I understand this correctly: this is better than the max queue length version because the maximum is controlled by the amount of memory available globally, so the queue fill has to be much, much larger to force a circuit kill needed for the potential exploit?
Making sure I understand this correctly: this is better than the max queue length version because the maximum is controlled by the amount of memory available globally, so the queue fill has to be much, much larger to force a circuit kill needed for the potential exploit?
Right. With all of the algorithms I outlined above, we only kill circuits once we're really about to be out of memory. So the tricks described in the original ticket here (and in the first comment) aren't sufficient to force a circuit kill.