OOM needs to consider the DESTROY queued cells

changed milestone to %Tor: unspecified in legacy/trac

added 034-removed-20180328 in Legacy / Trac 034-triage-20180328 in Legacy / Trac component::core tor/tor in Legacy / Trac milestone::Tor: unspecified in Legacy / Trac oom in Legacy / Trac priority::medium in Legacy / Trac severity::normal in Legacy / Trac status::new in Legacy / Trac tor-cell in Legacy / Trac tor-circuit in Legacy / Trac type::defect in Legacy / Trac labels

Replying to dgoulet:

But also not sending those will affects other relays hanging on dead circuits.

Yeah, this is an ugly one. I was first thinking about the case where a relay doesn't send back a destroy cell towards the client, so the client ends up with an out-of-sync idea of what the circuit looks like. But in that case, eventually the client might still try to close the circuit, and things will take care of themselves.

Where it gets really ugly is if the relay doesn't send a destroy forward on a circuit. Then the circuit essentially lives forever on the later relays. It will only be when the orconn that would have sent the destroy cell dies that the next relay will notice.

(If some other orconn on the dangling circuit dies, it could still trigger splintered dangling circuits: the relay on the client side of the broken orconn will send a truncated data cell towards the client, which will just be ignored since there's no circuit that it corresponds to. And then the splintered dangling circuit will live forever because nobody will ever know to tell it to go away.)

So, silently dropping destroy cells seems really bad and like we should really try to avoid it.

One option is to queue them somewhere, using the more efficient queue that we put in with legacy/trac#24666 (moved), and then send them "over the next little while". That is, it's not critical to send them immediately, so long as they are sent sometime.

Another option would be to rotate the long-term ORConn once an event has happened that caused us to drop destroy requests. That is, try to work towards closing the orconn, which will trigger destruction of the remaining circuits. But if even one long-lived circuit remains, that option is not so great, since it could remain for days or even weeks.

What do we know about the pattern of destroys when we are reacting to an oom case? For example, do we end up making decisions like "close all the circuits to that relay"? In that case we could close the entire orconn, right there, rather than sending thousands of destroy cells. We'd probably want to mark it for flush for a little while so its current contents have a chance to go out, but that approach seems workable if that's the pattern of destroys that we want to make.

Another option would be to make multidestroy cells that give you a huge pile of circids+reasons in a single cell -- basically extend the notion of the destroy queue into something that you can transport wholesale to a neighbor relay.

Another option would be to make a destroy-except cell, where if you want to close a big pile of circids but leave a few open, you send over the ones not to destroy.

While we're at it, we might want to get rid of the "send a truncate cell toward the client, and then let the client actually destroy the circuit" design. We built Tor that way so that clients could choose to have some smarter reaction in the future, like re-extending the circuit to some different next hop. But in practice we haven't figured out a smarter reaction that doesn't draw in a lot of complexity in terms of anonymity analysis, so maybe we should opt to simplify the design (and thus reduce network load).

Replying to arma:

So, silently dropping destroy cells seems really bad and like we should really try to avoid it.

While we're at it, we might want to get rid of the "send a truncate cell toward the client, and then let the client actually destroy the circuit" design.

Do we have a similar issue to the "dropping destroy cells" issue here, where we queue up some truncated cells, and then the oom killer kills them before they go out, and then we've essentially dropped those truncated cells? In theory this one isn't so bad, since the client should eventually try to close the circuit too. But it's another instance where the two sides can become unsynced because we are mixing data and control cells on the same level. And it would be (much closer to) resolved by opting to get rid of the truncated/destroy/destroyed dance.

Moving a bunch of tickets from 033 to 034.

Trac:
Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final

Trac:
Keywords: N/A deleted, 034-triage-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

Trac:
Keywords: N/A deleted, 034-removed-20180328 added

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: unspecified

moved from legacy/trac#24667 (moved)

added Bug label and removed 1 deleted label

removed 1 deleted label

added Project Ideas label

added 1 deleted label

removed Bug label

removed milestone

added Icebox label

added DoS label

removed 1 deleted label

We only keep the circuit ID now for the "destroy cell queue" so the OOM shouldn't attempt to clean it. CLosing.

closed

OOM needs to consider the DESTROY queued cells

Child items ...

Activity