Didn't recognize a cell, but circ stops here! Closing circuit

Adding @mikeperry as cc here, and @gk too since this affects network-health, maybe.

The question is: does it do us any good to report a circuit as "truncated" when we get a destroy from later in the circuit? Or should we just propagate the destroy?

(As a third option, we might want to say "truncated" if the destroy is for a failed extend, but not otherwise.)

The original reason for saying "truncated" was that we wanted to leave clients the option of re-extending or re-using the current circuit in the future. But no clients do that yet. Maybe under circumstances of heavy load it's better to make sure the circuit gets torn down right away?

(As a fourth option, we might want to propagate destroy back to the client only when the network is under heavy load.)

added S61-O4-Maybe - FINISHED Sponsor 61 - FINISHED labels

added Next Q3 labels

I think the best option is to immediately forward the destroy to all relays without trying to mess with TRUNCATED, so relays can avoid queuing in-flight data for a broken circuit, and instead immediately tear it down.

Trying to handle failure cases for re-extending and adding hops will be a nightmare. It will summon all the balrogs in The Maze.

Not only is it complicated/impossible to figure out the race condition between in-flight data and when the TRUNCATED happened, it will be a lot of complexity to manage trying to salvage a partially torn down circuit and re-add hops, and such complexity has been a repeated pain point in the past, even for simple non-error cases like cannibalization.

To be clear here in terms of engineering, this implies that we need to mark the circuit for close immediately which will send a DESTROY down but also avoid anything to be handled anymore on that circuit. Just sending the DESTROY, with our code, won't make it that anything on the circuit will be ignored.

And the other "problem" that it might create is that each hops will start sending TRUNCATED backward because they all are getting a `DESTROY.

So maybe, we should just stop sending TRUNCATED here and instead just send DESTROY one hop at a time? (which is kind of a non small protocol change?)

So maybe, we should just stop sending TRUNCATED here and instead just send DESTROY one hop at a time? (which is kind of a non small protocol change?)

Yeah, I still think this is best. Hops that don't upgrade will still send TRUNCATED instead of tearing down, but I don't think that makes any difference, because TRUNCATED still causes the client to tear down the circuit immediately, rather than do anything else with it.

AFAICT, clients never send TRUNCATE either, though there is old code to handle it at relays.

mentioned in merge request !600 (closed)

mentioned in merge request !603 (merged)

closed with commit 8d8afc4e

Fun! I see that we're marking the circuit for close with reason TORPROTOCOL:

+      circuit_mark_for_close(circ, END_CIRC_REASON_TORPROTOCOL);

but in tor-spec.txt section 5.4 it documents that reason as

     1 -- PROTOCOL        (Tor protocol violation.)

but this case isn't a protocol violation, right? Relays are allowed to go down, causing the circuits through them to get destroyed.

I think we want to do the same thing for this backward destroy as we do for the forward destroy, i.e. pass back the reason from the destroy we just received (which we used to do in payload[0] of the truncated cell):

circuit_mark_for_close(circ, reason|END_CIRC_REASON_FLAG_REMOTE);

For a much earlier time we messed around with this logic, check out commit bfdb93d8.

Good catch.

I'll get on to this.

FYI: !604 (closed)

reopened

My only remaining question, beyond the suggested change above, has to do with how we changed DESTROY cells to bypass the cell queues in #7912 (closed), in a way I never fully understood.

Is the change we're doing in this ticket going to result in losing legit data heading toward the client? Specifically, can the destroy go out ahead of the already queued cells? Looking at channel_flush_from_first_active_circuit() and circuitmux_get_first_active_circuit(), it looks like yes-it-can.

For example, let's say we have a circuit A -> R1 -> R2 -> R3, and R3 goes down. R2 will now send a destroy cell toward R1, and it will arrive faster than some of the queued data cells (causing them to be discarded when they arrive). But if R1 has cells queued to send to A, then the destroy that R1 sends to A could preempt those queued cells too!

It's ugly all around. @nickm, do you remember the destroy queue design? What was our reasoning for why it was fine to potentially lose user data? Maybe it was because clients don't typically send destroy cells while they still have an outstanding request out?

The bad case would be the one where in #7912 (closed) we decided that hack is safe because of our "truncated-not-destroy" logic, and then here we're making that change without remembering that dependency.

For example, let's say we have a circuit A -> R1 -> R2 -> R3, and R3 goes down. R2 will now send a destroy cell toward R1, and it will arrive faster than some of the queued data cells (causing them to be discarded when they arrive). But if R1 has cells queued to send to A, then the destroy that R1 sends to A could preempt those queued cells too!

This case doesn't seem that bad to me. If a DESTROY happens from a relay, the circuit is toast. The odds of even that partial data still being useful seem low.

But I think I still don't fully understand, so I have two questions:

Is that lost data still freed properly? Or does the hacky #7912 (closed) fix imply that it might still get stuck somewhere? So long as it is actually freed, we're ok, right?
If a client sends a DESTROY to tear down a circuit, might that pre-empt already sent data, and cause issues on normal non-error circuit teardown? This case worries me more, but so long as clients don't send normal teardown path DESTROY until they get their all SENDME acks back, this should be ok, right?

assigned to @dgoulet

You are correct, yes.

DESTROY cells are NOT queued along normal circuit cell queues but rather in their own queue from which we round-robin between the cmux destroy queue and the circuits queues.

And so, as you said, if data is in-flight toward the A, then it will be lost due to that DESTROY. For the why of that, that predates my time in Tor and it is something I have always wanted an answer on why. And the only comments I could find about this in the code are:

      /* XXXX We should let the cmux policy have some say in this eventually. */                                                                                
      /* XXXX Alternating is not a terribly brilliant approach here. */

So, what we did with this ticket is indeed affecting what you describe. Because one direction of the circuit is busted, we alleviate the memory pressure for that direction. It is the trade-off for the lost data on the other side.

A more indepth solution would likely be to rethink this DESTROY logic or somehow allow in-flight cells to continue on the opposite direction of the DESTROY cell received?

mentioned in merge request !604 (closed)

changed milestone to %Tor: 0.4.7.x-post-stable

closed

mentioned in issue #40655 (closed)

mentioned in commit tpo/core/debian/tor@dc13936f

mentioned in commit tpo/core/debian/tor@8d8afc4e

Didn't recognize a cell, but circ stops here! Closing circuit

Cause

Solution

Child items ...

Activity