Edge case that causes improper circuit prioritization for one scheduling run
= The Problem, Shortly
A circuit that goes from very busy for a long time, to 100% idle for a long time, and then needs to send traffic again will be incorrectly deprioritized the first time it gets scheduled.
= The Problem, Illustrated
Consider a circuit that is very very busy for a significant length of time (minutes). There's constant traffic flowing in one (or both, but let's just say one) direction on this circuit, leading to it earning for itself a high cell_count
EWMA value (thus a low priority for scheduling). Assume it is the only circuit on its channel.
Now assume it suddenly stops sending traffic but stays open. It stays this way for significant length of time (many 10s of seconds) such that its cell_count
EWMA value should be essentially zero, but it hasn't actually been updated yet since this value isn't updated until a cell has been transmitted (see circuitmux_notify_xmit_cells
).
At this point in time the relay is still servicing some number of low-traffic circuits on other channels. Maybe it has always been handling these circuits. Whatever. It doesn't matter. What matters is at this point in time there's lots of low-traffic circuits needing scheduling. Because they are low-traffic, these circuits have cell_count
EWMA values that are relatively low (thus a high priority for scheduling).
Now what happens when that original high-traffic circuit stops being totally idle? What happens when it wants to send another 1000, 100, or even just 1 cell?
It gets put into KIST channels_pending
smart list like any other circuit. In fact there are a bunch of low-bandwidth circuits in there with it. Observe what happens when KIST starts scheduling its collection of pending channels:
KIST loops over and over until its list of pending channels is empty. Each time it gets the channel with the current best-priority circuit, schedules one cell, updates the appropriate cell_count
, and puts the channel back in the pending list if necessary.
All those low-traffic circuits will be serviced first because they have low cell_count
values (high priority) as compared to the outdated cell_count
value for the original high-traffic circuit.
When the circuit finally gets to send its first cell after its long period of inactivity, its cell_count
EWMA value is corrected to be near zero. That's fine. But it should have been updated before scheduling decisions were made so that it was the first one to be scheduled.
= A solution
Add a touch
function in the circuitmux channel interface that tells the circuitmux and whatever its policy is to update its circuit priorities if desired.
Before entering the main scheduling loop, call this touch
function on all the pending channels. In the case of the EWMA policy, the touch
function would ultimately drill down to something like
ewma_touch(circuitmux_policy_data_t *pol_data)
{
ewma_policy_data_t *pol = NULL;
unsigned int tick;
double fractional_tick;
tor_assert(pol_data);
pol = TO_EWMA_POL_DATA(pol_data);
/* Rescale the EWMAs if needed */
tick = cell_ewma_get_current_tick_and_fraction(&fractional_tick);
if (tick != pol->active_circuit_pqueue_last_recalibrated) {
scale_active_circuits(pol, tick);
}
}
(Which you might observe is essentially the first part of ewma_notify_xmit_cells(...)
).