Race condition where relays will close in-progress circuits as "finished"
Discovered in #40357: if you (as a client) create a circuit to your first hop, and then you wait 58 seconds, and then you send an extend cell to add a second hop to your circuit, then you will be racing with circuit_expire_old_circuits_serverside() to see whether the first-hop relay finishes the connection to your second hop and sends your create cell first, or it closes your circuit because it doesn't think you're using it.
That's because the circuit_expire_old_circuits_serverside() logic says "Find each non-origin circuit that has been unused for too long, has no streams on it, came from a client, and ends here: mark it for close." but:
- it defines "ends here" as "circ->n_chan is NULL", without regard for whether, say, circ->n_chan_create_cell is set, and
- it defines "unused for too long" as "channel_when_last_xmit(or_circ->p_chan) <= cutoff", and your circuit hasn't xmited anything since it sent the CREATED cell back to you when you made the first hop.
We put this circuit_expire_old_circuits_serverside() function in place to close begindir connections quicker than clients normally would, so we don't use up all our resources if there are bugs that cause thousands of clients to launch begin-dir requests at us (see commit cb31978a). So the bug here is that our checks for one-hop begindir circuits are over-broad and there's an edge case that can include multi-hop circuits too. See #29665 (closed) for another edge case we already found and fixed.
I don't think clients hit this race much in practice, because it involves clients waiting a while before starting their second hop.
I hit it in #40357 because some of my extend requests, mysteriously, sit unanswered for a minute, triggering a "DESTROY because FINISHED" response from my first-hop relay, and preventing me from finding out if I would get a successful answer if I just waited longer.