Skip to content
Snippets Groups Projects
Commit 3ef7e6f1 authored by Matt Traudt's avatar Matt Traudt Committed by David Goulet
Browse files

sched: Don't get KIST stuck in an infinite loop


When a channel is scheduled and flush cells returns 0 that is no cells to
flush, we flag it back in waiting for cells so it doesn't get stuck in a
possible infinite loop.

It has been observed on moria1 where a closed channel end up in the scheduler
where the flush process returned 0 cells but it was ultimately kept in the
scheduling loop forever. We suspect that this is due to a more deeper problem
in tor where the channel_more_to_flush() is actually looking at the wrong
queue and was returning 1 for an empty channel thus putting the channel in the
"Case 4" of the scheduler which is to go back in pending state thus
re-considered at the next iteration.

This is a fix that allows the KIST scheduler to recover properly from a not
entirelly diagnosed problem in tor.

Fixes #23676

Signed-off-by: David Goulet's avatarDavid Goulet <dgoulet@torproject.org>
parent fc6c0b46
No related branches found
No related tags found
No related merge requests found
o Major bugfixes (scheduler):
If a channel is put into the scheduler's pending list, then it starts
closing, and then if the scheduler runs before it finishes closing, the
scheduler will get stuck trying to flush its cells while the lower layers
refuse to cooperate. Fix that race condition by given the scheduler an
escape method. Fixes bug 23676; bugfix on 0.3.2.1-alpha
......@@ -591,8 +591,23 @@ kist_scheduler_run(void)
if (flush_result > 0) {
update_socket_written(&socket_table, chan, flush_result *
(CELL_MAX_NETWORK_SIZE + TLS_PER_CELL_OVERHEAD));
} else {
/* XXX: This can happen because tor sometimes does flush in an
* opportunistic way cells from the circuit to the outbuf so the
* channel can end up here without having anything to flush nor needed
* to write to the kernel. Hopefully we'll fix that soon but for now
* we have to handle this case which happens kind of often. */
log_debug(LD_SCHED,
"We didn't flush anything on a chan that we think "
"can write and wants to write. The channel's state is '%s' "
"and in scheduler state %d. We're going to mark it as "
"waiting_for_cells (as that's most likely the issue) and "
"stop scheduling it this round.",
channel_state_to_string(chan->state),
chan->scheduler_state);
chan->scheduler_state = SCHED_CHAN_WAITING_FOR_CELLS;
continue;
}
/* XXX What if we didn't flush? */
}
/* Decide what to do with the channel now */
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment