Confidential - TROVE-2022-001: Congestion control RTT injected delay
Public ticket that started this: #40624
This only affects >= tor-0.4.7.2-alpha. I have reserved TROVE-2022-001 at the moment and set it to HIGH considering the remote nature of the bug and its consequences on the network.
What
It appears that congestion control can enter a state that makes it never exit the CC slow-start. This means in concrete terms that tor can never exit its "initial congestion window" (set at 2 cells right now) thus having extremely slow circuits. As a client, we are talking couple KB/sec.
This in theory can be triggered in two ways which one can be done remotely:
-
A clock jump.
-
A tor withholding a
SENDME
for a couple of minutes would also trigger this condition.
The (2) is the one that is very worrying because anyone can trigger that. A malicious client could do that to an onion service effectively turning "off" congestion control for that service sending a pretty huge signal to a Guard/Middle relay.
But, it is likely also possible that mobile tor client could go dormant just before needing to send a SENDME
and then coming back online much later sending it and thus triggering this condition on the endpoint (onion service, relay).
Another possibility, like my non-Exit relay ended up in, is for directory request to stall long enough leading to that problem. Directory authority are often overwhelmed or heavily throttled/DPI (Faravahar).
Or for a malicious client to upload a descriptor on an HSDir (any relay) and withholding that SENDME
again triggering this problem rendering the relay almost unusable.
In a nutshell, the network can come to a grinding halt if we don't fix this else we need to disable CC asap.
How
The problem lies in time_delta_stalled_or_jumped()
which checks if the circuit new RTT is very much out of range from the previous one. In that case, it sets is_monotime_clock_broken = true
which is global to tor as in affecting all circuits. And then, it returns true
so the circuit RTT is not updated because we believe the clock is no bueno.
But, from that point on, every call to time_delta_stalled_or_jumped()
will return true
because of the guard if (old_delta == 0)
where old_delta
is circuit->cc->ewma_rtt_usec
which starts at 0 for a new circuit and now because is_monotime_clock_broken = true
, it will stay 0, never able to come back to false
.
This means, the circuit never gets to measure its RTT and thus never exit slow starts.
Solution
Proposed patch by @mikeperry : mikeperry/tor@4bdcfdf6