Our OOM is only looking a the circuit queue cells and HS descriptors to free up memory.
We need to teach it to cleanup DESTROY cells in case cleaning up the circuits is not enough.
This isn't that trivial because while cleaning up circuits in the OOM handler, we will also send DESTROY cells for those thus allocating memory. But also not sending those will affects other relays hanging on dead circuits.
All in all, this is an interesting challenge but there might be something smart to do even if not perfect.
The idea here is to avoid an attack that takes advantage of a bug in tor that can fill up the DESTROY cell queue and our OOM just can't do anything about it.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
But also not sending those will affects other relays hanging on dead circuits.
Yeah, this is an ugly one. I was first thinking about the case where a relay doesn't send back a destroy cell towards the client, so the client ends up with an out-of-sync idea of what the circuit looks like. But in that case, eventually the client might still try to close the circuit, and things will take care of themselves.
Where it gets really ugly is if the relay doesn't send a destroy forward on a circuit. Then the circuit essentially lives forever on the later relays. It will only be when the orconn that would have sent the destroy cell dies that the next relay will notice.
(If some other orconn on the dangling circuit dies, it could still trigger splintered dangling circuits: the relay on the client side of the broken orconn will send a truncated data cell towards the client, which will just be ignored since there's no circuit that it corresponds to. And then the splintered dangling circuit will live forever because nobody will ever know to tell it to go away.)
So, silently dropping destroy cells seems really bad and like we should really try to avoid it.
One option is to queue them somewhere, using the more efficient queue that we put in with legacy/trac#24666 (moved), and then send them "over the next little while". That is, it's not critical to send them immediately, so long as they are sent sometime.
Another option would be to rotate the long-term ORConn once an event has happened that caused us to drop destroy requests. That is, try to work towards closing the orconn, which will trigger destruction of the remaining circuits. But if even one long-lived circuit remains, that option is not so great, since it could remain for days or even weeks.
What do we know about the pattern of destroys when we are reacting to an oom case? For example, do we end up making decisions like "close all the circuits to that relay"? In that case we could close the entire orconn, right there, rather than sending thousands of destroy cells. We'd probably want to mark it for flush for a little while so its current contents have a chance to go out, but that approach seems workable if that's the pattern of destroys that we want to make.
Another option would be to make multidestroy cells that give you a huge pile of circids+reasons in a single cell -- basically extend the notion of the destroy queue into something that you can transport wholesale to a neighbor relay.
Another option would be to make a destroy-except cell, where if you want to close a big pile of circids but leave a few open, you send over the ones not to destroy.
While we're at it, we might want to get rid of the "send a truncate cell toward the client, and then let the client actually destroy the circuit" design. We built Tor that way so that clients could choose to have some smarter reaction in the future, like re-extending the circuit to some different next hop. But in practice we haven't figured out a smarter reaction that doesn't draw in a lot of complexity in terms of anonymity analysis, so maybe we should opt to simplify the design (and thus reduce network load).
So, silently dropping destroy cells seems really bad and like we should really try to avoid it.
While we're at it, we might want to get rid of the "send a truncate cell toward the client, and then let the client actually destroy the circuit" design.
Do we have a similar issue to the "dropping destroy cells" issue here, where we queue up some truncated cells, and then the oom killer kills them before they go out, and then we've essentially dropped those truncated cells? In theory this one isn't so bad, since the client should eventually try to close the circuit too. But it's another instance where the two sides can become unsynced because we are mixing data and control cells on the same level. And it would be (much closer to) resolved by opting to get rid of the truncated/destroy/destroyed dance.