Trac issueshttps://gitlab.torproject.org/legacy/trac/-/issues2020-06-13T15:47:10Zhttps://gitlab.torproject.org/legacy/trac/-/issues/32196cmux: Implement unit tests2020-06-13T15:47:10ZDavid Gouletdgoulet@torproject.orgcmux: Implement unit testsBottom line is that we have code coverage for the cmux/ewma code but no unit tests actually testing validity/correctness.
This ticket it to implement tests as much as we can.
#29698 is the first addition in a long time to the cmux/ewma...Bottom line is that we have code coverage for the cmux/ewma code but no unit tests actually testing validity/correctness.
This ticket it to implement tests as much as we can.
#29698 is the first addition in a long time to the cmux/ewma subsystem and we have *no* infrastructure whatsoever to test it. We can't merge it until we have proper testing. Thus, putting this ticket as a child.Tor: 0.4.3.x-finalDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/legacy/trac/-/issues/29698Edge case that causes improper circuit prioritization for one scheduling run2020-06-13T15:47:11ZpastlyEdge case that causes improper circuit prioritization for one scheduling run= The Problem, Shortly
A circuit that goes from very busy for a long time, to 100% idle for a long time, and then needs to send traffic again will be incorrectly deprioritized the first time it gets scheduled.
= The Problem, Illustrate...= The Problem, Shortly
A circuit that goes from very busy for a long time, to 100% idle for a long time, and then needs to send traffic again will be incorrectly deprioritized the first time it gets scheduled.
= The Problem, Illustrated
Consider a circuit that is very very busy for a significant length of time (minutes). There's constant traffic flowing in one (or both, but let's just say one) direction on this circuit, leading to it earning for itself a high `cell_count` EWMA value (thus a low priority for scheduling). _Assume it is the only circuit on its channel_.
Now assume it suddenly stops sending traffic but stays open. It stays this way for significant length of time (many 10s of seconds) such that _its `cell_count` EWMA value should be essentially zero, but it hasn't actually been updated yet_ since this value isn't updated until a cell has been transmitted (see `circuitmux_notify_xmit_cells`).
At this point in time the relay is still servicing some number of low-traffic circuits _on other channels_. Maybe it has always been handling these circuits. Whatever. It doesn't matter. What matters is at this point in time there's lots of low-traffic circuits needing scheduling. Because they are low-traffic, these circuits have `cell_count` EWMA values that are relatively low (thus a high priority for scheduling).
Now what happens when that original high-traffic circuit stops being totally idle? What happens when it wants to send another 1000, 100, or even just 1 cell?
It gets put into KIST `channels_pending` smart list like any other circuit. In fact there are a bunch of low-bandwidth circuits in there with it. Observe what happens when KIST starts scheduling its collection of pending channels:
KIST loops over and over until its list of pending channels is empty. Each time it gets the channel with the current best-priority circuit, schedules one cell, _updates the appropriate `cell_count`_, and puts the channel back in the pending list if necessary.
**All those low-traffic circuits will be serviced first because they have low `cell_count` values (high priority) as compared to the outdated `cell_count` value for the original high-traffic circuit.**
When the circuit finally gets to send its first cell after its long period of inactivity, its `cell_count` EWMA value is corrected to be near zero. That's fine. _But it should have been updated before scheduling decisions were made so that it was the first one to be scheduled_.
= A solution
Add a `touch` function in the circuitmux channel interface that tells the circuitmux and whatever its policy is to update its circuit priorities if desired.
Before entering the main scheduling loop, call this `touch` function on all the pending channels. In the case of the EWMA policy, the `touch` function would ultimately drill down to something like
```
ewma_touch(circuitmux_policy_data_t *pol_data)
{
ewma_policy_data_t *pol = NULL;
unsigned int tick;
double fractional_tick;
tor_assert(pol_data);
pol = TO_EWMA_POL_DATA(pol_data);
/* Rescale the EWMAs if needed */
tick = cell_ewma_get_current_tick_and_fraction(&fractional_tick);
if (tick != pol->active_circuit_pqueue_last_recalibrated) {
scale_active_circuits(pol, tick);
}
}
```
(Which you might observe is essentially the first part of `ewma_notify_xmit_cells(...)`).Tor: unspecifiedDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/legacy/trac/-/issues/29494Optimize interaction between circuitmux and circuitpadding2020-06-13T15:38:10ZMike PerryOptimize interaction between circuitmux and circuitpaddingIn #29204, we realized that we needed to inspect the circuit queue to prevent the circuit padding system from queuing too much padding. For that ticket, we're going to do the simple fix -- we don't schedule padding if the circuit's cell ...In #29204, we realized that we needed to inspect the circuit queue to prevent the circuit padding system from queuing too much padding. For that ticket, we're going to do the simple fix -- we don't schedule padding if the circuit's cell queue already has a window full of cells on it. In that case, we just lie to circpad and tell the machine we sent padding even though we did not.
However, we should determine if we want to have a more complicated interaction, such as delaying our padding until the queue drains. We should also examine creating alternate machine events for when cells are dequeued from circuitmux vs when they are sent locally. This has pluses and minuses for machine interaction.Tor: unspecifiedMike PerryMike Perryhttps://gitlab.torproject.org/legacy/trac/-/issues/29196circ: Remove p_mux and n_mux from circuit_t and or_circuit_t2020-06-13T15:37:14ZDavid Gouletdgoulet@torproject.orgcirc: Remove p_mux and n_mux from circuit_t and or_circuit_tThey are simply not used apart from assigning a pointer and asserting on the pointer depending on direction.
Complexity that is not needed.They are simply not used apart from assigning a pointer and asserting on the pointer depending on direction.
Complexity that is not needed.Tor: 0.4.1.x-finalDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/legacy/trac/-/issues/25328cmux: Refactor, test and improve performance of the circuitmux subsystem2020-06-13T15:22:18ZDavid Gouletdgoulet@torproject.orgcmux: Refactor, test and improve performance of the circuitmux subsystemThe cmux subsystem (`src/or/circuitmux.c`) is part of tor's fast path. It gets called at every cell or at every few cell and every new or dying circuit.
It is currently in our top 5 of CPU hugger (see #25152) and also has a certain comp...The cmux subsystem (`src/or/circuitmux.c`) is part of tor's fast path. It gets called at every cell or at every few cell and every new or dying circuit.
It is currently in our top 5 of CPU hugger (see #25152) and also has a certain complexity to it. However, in practice, it should be quite simple. #25268 helps a lot by removing dead code and helping making the code more readable.
In order to improve that subsystem, there are few steps that need to be taken before we can address something like #25152. Furthermore, in order to fix some things in the scheduler (#23993), this work should be done so we don't add more complexity to that code.
Refactoring this subsystem to something more simple also will help testing it for which right now, it is virtually untested (see `src/test/test_circuitmux.c`).
I'm taking this ticket because I do have planned out this work already and it might change over time.Tor: unspecifiedhttps://gitlab.torproject.org/legacy/trac/-/issues/25312circ: We can pick an active circuit that is marked for close2020-06-13T15:22:15ZDavid Gouletdgoulet@torproject.orgcirc: We can pick an active circuit that is marked for closeThe issue lies in when we detach cicuits from the cmux.
The function `circuitmux_get_first_active_circuit()` returns the first active circuit from the given cmux (attached to the channel).
But, when we mark for close a circuit, we don'...The issue lies in when we detach cicuits from the cmux.
The function `circuitmux_get_first_active_circuit()` returns the first active circuit from the given cmux (attached to the channel).
But, when we mark for close a circuit, we don't detach it from the cmux, it is only done "before free" so the result is that `cmux->policy->pick_active_circuit()` can return a marked for close circuit then a cell is dequeued from it and sent on the wire.
In my experimentation, I only saw END, DROP and TRUNCATED relay commands being sent on the wire from a marked for close circuit. Thus, I don't think that is currently such a big problem but still, not that good I would say.
We should assume that from the time the circuit is marked for close, nothing should go outbound on it from that point on.
Possible solution would be to detach the circuit from the cmux when marked for close or make the active circuit function ignore closed circuit.
Not sure at this point about backport.Tor: unspecifiedhttps://gitlab.torproject.org/legacy/trac/-/issues/25268cmux: Remove round robin code and make EWMA mandatory2020-06-13T15:22:05ZDavid Gouletdgoulet@torproject.orgcmux: Remove round robin code and make EWMA mandatorySince 0.2.4.x stable, Tor has been using EWMA circuit policy which is enabled by the consensus param: `CircuitPriorityHalflifeMsec=30000`
The `circuitmux.c` code still does quite a bit to maintain the round robin circuit policy that was...Since 0.2.4.x stable, Tor has been using EWMA circuit policy which is enabled by the consensus param: `CircuitPriorityHalflifeMsec=30000`
The `circuitmux.c` code still does quite a bit to maintain the round robin circuit policy that was used prior to EWMA. It hasn't been used since 024 (except in Chutney, see #25265).
A lot of that code is called for every cell tor receives so it is part of our fast path. Removing this round robin code would help in performance and remove complexity from the code.
However, that round robin functionality is used as a fallback if the EWMA policy was either not set or couldn't pick an active circuit (I'm not sure that is possible if we have active circuits).
Thus, the side effect of removing this code is that we'll now enforce a cmux to have a policy set and make the `pick_active_circuit()` callback mandatory. That is OK because we've been enforcing that since 024.
It would also probably mean that we need to put in a default value in tor code for `CircuitPriorityHalflifeMsec` so that if the consensus doesn't specify it, we fallback to a usable value instead of being able to disable EWMA.
This piece of work will help out with improving performance of the cmux subsystem with #25152 and make sure we stick to only the required code for this fast pathTor: 0.3.4.x-finalDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.org