#9063 enables Guard discovery in about an hour by websites

changed milestone to %Tor: 0.2.3.x-final

added component::core tor/tor milestone::Tor: 0.2.3.x-final priority::high resolution::fixed status::closed type::defect version::tor 0.2.4.13-alpha labels

Two more variations:

A) A website just causes Alice to launch 2200 stream connections. Then she gets 2200 connected cells heading her way.

B) If Alice is using the leaky pipe design and has two circ-window-maxes heading her way, but she's also uploading stuff and triggers a sendme cell to return to her, that brings us to 2001.

The attack enabled by #9063 (moved) is extremely similar to the Guard discovery attack from rpw's paper.

It would appear to allow a malicious website to find your guards in about an hour..

Rock, meet the hard place :/.

Trac:
Cc: robgjansen, Flo, nickm to robgjansen, Flo, nickm, rpw
Priority: normal to major

Attack is courtesy of our friend boboper.

Thanks bob!

Trac:
Summary: #9063 (moved) went too far to #9063 (moved) enables Guard discovery of 0.2.4.x clients in about an hour by websites

See also #9001 (moved) for potential longer-term solutions.

Actually, this is all clients. Bob pointed out that it is their Guards who have to upgrade for them to be vulnerable, not them.

Pretty bad stuff.

Perhaps we should remove #9063 (moved) and save it for Gaurds who actually experience OOMing?

Trac:
Summary: #9063 (moved) enables Guard discovery of 0.2.4.x clients in about an hour by websites to #9063 (moved) enables Guard discovery in about an hour by websites
Priority: major to critical

Geez, fuck -- yeah, probably roll it back unless/until we can find a smarter way to implement it.

My bug9072-023, bug9072-024 and bug9072-025 #if out the #9063 (moved) code.

Trac:
Status: new to needs_review

Hm. Targetted OOM, or guard discovery. Not a great choice.

We might need to do the 0.2.5 version of a #9063 (moved) fix that we were talking about, where we defined a maximum space allocatable for cells, and do some kind of OOM killer on on clogged-looking circuits if we run out of space for new queued cells. Not great, and not delightful to contemplate for an 0.2.4 backport, but the OOM situation isn't greater either.

andrea: feel free to merge those for now .

This doesn't make discovering a client's entry guards significantly easier. It does make failing a client's circuits from the exit/destination end significantly easier, and as soon as the path-bias detector is turned on, that becomes a full remote user-tracing attack.

Here are a few sketch oom-killer algorithms. Let's try to find flaws in them and find something better.

Algorithm 1.

have a max number of allocated cells configured.
when that number is reached, consider all circuits and kill those with the longest queues until we are at X% of the limit.

Algorithm 2.

as in alg 1 above.
when he hit the limit, consider the connections with the largest number of queued cells on circuits aiming for them.
kill those connections until we are at X% of the limit.

Algorithm 3. 1, 2) as in alg 2 above. 3) kill the biggest circuits on those connections until we are at X% of our limit.

Algorithms 1T, 2T, 3T: As algorithms 1, 2, 3, but instead of longest queue, look at longest flush time, where flush time is the queue length divided by the connection's observed bandwidth.

IMO none of 1, 2, 3 is too complex for 0.2,4.

Algorithms 1U, 2U, 3U:

As algorithms 1T 2T and 3T, but define a queue's flush time as some function that accounts for the number of other circuits targetting the same connection. If we were using round robin, I would say: N/BW where bw is the conn observed bandwidth and N is the sum of all queue lengths on that connection, each queue length clippedd to be no longer than the queue under consideration.

Instead of closing circuits or connections in response to OOM, you could check whether the relay has hit its rate limit, and if it hasn't, send a bunch of cells along the circuit/connection. (Perhaps the next relay will have more room for them.)

How long are cell queues for normal clients' circuits and connections in practice?

"Maximum number of cells in entrys's circuit queue = 8 * (1000_data + 10_sendmecirc + 20_sendme_stream + N * (1_connected + 1_end)), N is number of maximum allowed streams for client"

"so 10k is actual number for client with 100 streams limit"

(quoting from irc)

If we disable the leaky pipe feature, 2k cells is enough, with 100 stream limit.

Of course, this isn't counting drop cells, which we might want to add in at some point to help with website fingerprinting.

Replying to nickm_mobile:

Algorithms 1U, 2U, 3U:

As algorithms 1T 2T and 3T, but define a queue's flush time as some function that accounts for the number of other circuits targetting the same connection. If we were using round robin, I would say: N/BW where bw is the conn observed bandwidth and N is the sum of all queue lengths on that connection, each queue length clippedd to be no longer than the queue under consideration.

Is there any reason not to simplify this down to the circuit level only? Anything based on connections is easily subverted by a sybil attack. And you can't consider the time-independent queue length alone because of the sybil attack.

The goal is to drive up the cost of the attack while making sure you are not killing an honest client's circuit. If you consider a function of the queue length and the waiting time of the longest waiting cell for each circuit, then a malicious client will have to read cells at a rate at least as fast as the slowest honest client on every sybil in order to cause the relay to target the honest client's circuit. It can be shown that the longer the attacker's queue becomes, the faster it must read from it to avoid selection because it needs to keep the waiting times of the cells in the malicious circuits lower than those of all honest circuits.

I guess authenticated SNEDMEs, where the exit inserts a 1 byte nonce into every 100th cell and won't increase the package window unless the SENDME contains the nonce, is also off of the table since like the queue length limit defense, it doesn't handle non-data cells.

I imagine its not a simple fix, but what are the other problems with making all classes of relay cells count against the circuit package window?

rransom said:

Instead of closing circuits or connections in response to OOM, you could check whether the relay has hit its rate limit, and if it hasn't, send a bunch of cells along the circuit/connection. (Perhaps the next relay will have more room for them.)

That sounds plausible as a reliability improvement, but unless I misunderstand, it won't actually prevent any DOS here, since the recipient could just stop reading and thereby make the sender's TCP implementation refuse to write any more.

How long are cell queues for normal clients' circuits and connections in practice?

Less than 1000; I don't know how much less.

arma wrote:

If we disable the leaky pipe feature, 2k cells is enough, with 100 stream limit.

Is there any such 100 stream limit? I'm not seeing it in the code today. So if we want to protect existing clients from this issue, 2k is too low. We'd need to write a patch to limit that, which would create another way for a hostile website to make a client open a ton of circuits. (We could mitigate that a little by applying a limit only to the number of circuits for which we've sent a BEGIN but not gotten a CONNECTED, I guess, and delaying pending streams

So if I'm not wrong about the code as it stands, that implies that N=65535, so the magic number is a hefty 1056816 (or merely 264204 if you don't believe in leaky-pipe). Not so good.

Or am I missing a real 100-stream limit somewhere?

robgjansen wrote:

I guess authenticated SNEDMEs, where the exit inserts a 1 byte nonce into every 100th cell and won't increase the package window unless the SENDME contains the nonce, is also off of the table since like the queue length limit defense, it doesn't handle non-data cells. I imagine its not a simple fix, but what are the other problems with making all classes of relay cells count against the circuit package window?

The biggest problems I know with each of these are the migration issues. We can't require that clients and exits use them until all clients and exits support them. That doesn't make them complete non-starters, but it does keep us from deploying them any time, say, in the next two weeks. :)

Replying to robgjansen:

Is there any reason not to simplify this down to the circuit level only? Anything based on connections is easily subverted by a sybil attack. And you can't consider the time-independent queue length alone because of the sybil attack.

What you say seems plausible. Though it might not be too bad to do a time-independent queue length as "good enough for now" in 0.2.4 and maybe 0.2.3, then work on the better thing in 0.2.5, targeting a possible backport to 0.2.4 if the better thing turns out to be not too hard. After all, if the #9063 (moved) solution seemed "good enough as a temporary measure" (barring this issue) then surely a queue-length-based solution is not worse.

I think this here (algorithm 1 in 0.2.3 and 0.2.4; work on better stuff for 0.2.5 for a possible backport) is my current preferred approach, assuming there aren't flaws in it beyond what I know. Finding the exactly-right approach for 0.2.5 seems like it might take a bunch of discussion and analysis.

The goal is to drive up the cost of the attack while making sure you are not killing an honest client's circuit. If you consider a function of the queue length and the waiting time of the longest waiting cell for each circuit, then a malicious client will have to read cells at a rate at least as fast as the slowest honest client on every sybil in order to cause the relay to target the honest client's circuit. It can be shown that the longer the attacker's queue becomes, the faster it must read from it to avoid selection because it needs to keep the waiting times of the cells in the malicious circuits lower than those of all honest circuits.

You don't want a function of the longest waiting cell for each circuit, I think: a you want a function of the expected time to drain the queue.[*] That's why I was looking at connection flush rates -- the speed at which the connection is draining, and the lengths of the other circuit queues on the connection, determine how long it will take to drain the queue.

[*] Why look at the expected drain time and not age of the longest waiting cell? Imagine a slow connection where a bunch of legit circuits have had modestly long queues for a while, and a new connection has gotten a really egregious queue really quickly. The ones that have been around longer seem like they could be the ones to get killed, which isn't so great.

Replying to nickm:

andrea: feel free to merge those for now .

Merged.

Trac:
Resolution: N/A to fixed
Status: needs_review to closed

I'm going to suggest we keep this open for now to discuss what fix we want to do in 0.2.3/0.2.4.

Trac:
Milestone: Tor: 0.2.3.x-final to Tor: 0.2.4.x-final
Priority: critical to major
Resolution: fixed to N/A
Status: closed to reopened
Component: Tor to Thandy

Trac:
Component: Thandy to Tor

I wrote:

I think this here (algorithm 1 in 0.2.3 and 0.2.4; work on better stuff for 0.2.5 for a possible backport) is my current preferred approach, assuming there aren't flaws in it beyond what I know. Finding the exactly-right approach for 0.2.5 seems like it might take a bunch of discussion and analysis.

See #9063 (moved) for my attempt at hacking that together. It needs review.

From our IRC friend:

"idea with streams limiting for #9072 (moved) is wrong too. idea with limiting half open streams is better but still fragile. no way to limit queue length with current protocol. and don't forget about many relay_truncate, relay_truncated, that can be used actively with leaky pipe topology."

I asked him: what if we made a patch that was a torrc option, so we at least had something in a release we could tell people "Turn this on if you're a guard and experience ooms?" how should we implement that option? is there a least bad choice?

limiting streams to limit queue length to protect against killer website but protect against oom is kludge. need to think about torrc option.

Replying to rransom:

How long are cell queues for normal clients' circuits and connections in practice?

Pretty short; testing the original #9063 (moved) patch I had to reduce the maximum to 5 cells to observe circuit kills.

Replying to nickm_mobile:

Here are a few sketch oom-killer algorithms. Let's try to find flaws in them and find something better.

Algorithm 1.

have a max number of allocated cells configured.

when that number is reached, consider all circuits and kill those with the longest queues until we are at X% of the limit.

Making sure I understand this correctly: this is better than the max queue length version because the maximum is controlled by the amount of memory available globally, so the queue fill has to be much, much larger to force a circuit kill needed for the potential exploit?

Replying to andrea:

Making sure I understand this correctly: this is better than the max queue length version because the maximum is controlled by the amount of memory available globally, so the queue fill has to be much, much larger to force a circuit kill needed for the potential exploit?

Right. With all of the algorithms I outlined above, we only kill circuits once we're really about to be out of memory. So the tricks described in the original ticket here (and in the first comment) aren't sufficient to force a circuit kill.

Trac:
Milestone: Tor: 0.2.4.x-final to Tor: 0.2.3.x-final
Status: reopened to closed
Resolution: N/A to fixed

(I merged the bug9063_redux_023_squashed branch.)

The place to figure out the best long-term fix is now #9093 (moved).

closed

mentioned in issue #9093 (moved)

mentioned in issue #25226 (moved)

moved to tpo/core/tor#9072 (closed)

mentioned in issue tpo/core/tor#9093 (closed)

#9063 enables Guard discovery in about an hour by websites

Child items ...

Activity