sched: With KIST, a channel can be added more than once in the pending list
Here is the flow for this to happen (and it has been observed):
The scheduler flushes 100 bytes onto the outbuf of a channel. Then the channel it put in the re-add list (which is to add back the channel in the pending list at the end of the scheduler loop) because it can't write anymore. Its state is set to waiting_to_write
.
We then handle the next channel in the main loop but before that we'll try to write to kernel the outbuf of the previous channel which is in the re-add list.
If the outbuf is fully drained, scheduler_channel_wants_writes()
is called which will add the channel back in the pending list IF it is in the waiting_to_write
state which it is because it was set to that state just before being added to the re-add list.
Scheduler loop ends and we end up with twice the channel in the pending list. This can go on for a while resulting in the same channel being added many more times.
There are two ways to fix that, one quick and one more logical.
-
The quick one is to let the state of the channel in PENDING mode (which is what Vanilla does) before adding it to the re-add list. That way, it won't get added back in the pending list by any callbacks.
-
Originally, the assumption was that KIST takes care of the write on the socket and only KIST. But the reality is different, KIST will trigger writes to the kernel but anything after that, any errors or retry is handled by libevent
write_event
(#24448 (moved), and #24449 (moved)).
So, it doesn't make sense for KIST to reschedule the channel as pending if it is waiting to write on the socket because from that point on, it will be the job of libevent to actually flush it with its poll(POLLOUT)
feature. Thus the fix is to never add back a channel that is waiting to write.
I personally would like to have (2) for two reasons. First, we would save CPU time and useless syscalls in this fast path. Second, adding a channel that is waiting to write back into the pending list is not really good in terms of priority. It should not get considered for another round in the loop, it should simply wait until the socket has been written since the assumption in (2) is false in the first place.