sched: When OR connection opens, always indicate the channel is ready for cells
Problem
In channel_tls_handle_state_change_on_orconn()
, when called when the OR connection becomes open, we have this snippet of code for when the new state is OPEN:
channel_change_state_open(base_chan);
if (connection_or_num_cells_writeable(conn) > 0) {
scheduler_channel_wants_writes(base_chan);
}
So basically, the above means we only indicate the scheduler that the channel wants to write if we already have cells on the outbuf
.
It basically means that if the channel scheduler state is IDLE
(initial opening state), it then ends up in state SCHED_CHAN_WAITING_FOR_CELLS
which then means the scheduler will process the channel when a cell is queued on it. But ONLY if the channel had cells when it was opened.
This on its own, as a design, is problematic because then what can make the channel transition into a state that allows the scheduler to recognize the channel as ready to be processed for cells? (SCHED_CHAN_WAITING_FOR_CELLS
).
Tor currently has 2 callsites that tells the scheduler that a channel "wants to write" data on the wire (scheduler_channel_wants_writes()
, remember that function transition the scheduler state of the channel to "waiting for cells"):
-
It is mentioned above that is when the channel becomes
OPEN
. -
The other one is when data is flushed from the outbuf, in
connection_or_flushed_some()
.
So once the channel is opened, we only become in the "wants to write" state if we previously had cells in the outbuf, which I can assure you is not always the case. And the other way is when we just wrote bytes on the network but then how can we do that in the first place?
What Is Saving Us
One question is then: Maybe tor code flow makes it that we always have a cell in the outbuf when the channel opens?
I conducted an experiment which made an Entry node of a client to only send 1 cell at a time per mainloop round. This made it that the VERSIONs
, CERTS
and NETINFO
cell were sent in 3 different mainloop round and thus the client received them with a "delay".
That was enough for the client's channel to have nothing in the outbuf when the channel became OPEN
that is when the NETINFO
cell is received from the relay which skipped the scheduler state change and thus the client channel was stuck there in scheduler IDLE
mode even though the channel was in OPEN
state.
What is saving us is because the very first thing we write on the wire, VERSIONS
cell, calls the (2) callsite that tells the scheduler to go in "waiting for cells" state. And from there, the channel stays in that state.
For now, this seems to be "fine" but any future changes like for instance where we wanted to have everything that writes on the outbuf
to go through the scheduler so we can have proper KIST prioritization, will fail due to this design.
Short-Term Solution
When the channel opens, we should simply put it in wants to write
state all the time. So even bouncing from MAINT
to OPEN
state and vice versa will never make some cells stuck in the channel until something is explicitly written on the outbuf.
Furthermore, it should be done in channel_change_state_()
since this affects channel_t
, it is NOT specific to channeltls_t
. So in a world where we end up with multiple channel type, this whole situation explodes.