undefined (#40626)

@eta - You probably also want to take a look at this and make sure the arti RTT implementation does not have similar issues with caching clock state. This is a pretty subtle bug, and if you are not doing BDP calcs, it may be easier to trigger in arti..

I don't think it currently does (??); the is_clock_stalled function only checks that the current RTT (i.e. that just observed) is 0, not the stored EWMA value.

However, it looks like the proposed MR fixes more than just that; it also 'defangs' the caching mechanism significantly more, making it less able to be triggered by the "observed value some large factor out from what we were expecting" case. I guess I should probably implement those fixes, too?

Yes, definitely. This caching will need to change in arti too. I will update the test vectors and the spec today.

I think that the solution is good (replacing the old_delta==0 check with one for whether we've had an rtt measurement before).

I'm not sure that it makes sense to use the old_delta / new_delta ratio at all for this test, however. That's something that an individual circuit can screw up, so maybe we shouldn't use it to adjust a global flag that affects every other circuit.

Do we know what causes "broken monotime" in the wild? Is it really a matter of monotime_absolute_usec() not behaving as advertised? If so, maybe we can detect that in our calls to monotime_absolute_usec() more directly.

I'm not sure that it makes sense to use the old_delta / new_delta ratio at all for this test, however. That's something that an individual circuit can screw up, so maybe we shouldn't use it to adjust a global flag that affects every other circuit.

Yes. I think that we could go with not caching the value in one or both of the ratio cases. I want to think on this a bit more though.

I think that the most severe clock-adjustment case from a congestion control perspective is that a clock that moves backwards will yield lower than expected RTT, or 0 RTT. I think this stall result should still be cached and then reset globally once normality returns, to avoid this as much as possible. Underestimating minRTT is very bad and can also cause stalls, in the same way (especially if NTP spoofing can trigger this on some platforms).

I could be talked out of caching entirely, though, if we think this case is not common either. @dgoulet and I briefly considered it.

Do we know what causes "broken monotime" in the wild? Is it really a matter of monotime_absolute_usec() not behaving as advertised? If so, maybe we can detect that in our calls to monotime_absolute_usec() more directly.

Last time I asked wrt C-Tor, nobody remembers, and I could not could discern all of these cross-platform details, esp given shifting platform implementations of underlying clock APIs since it was written. Do you have these kinds of details on the clock that arti uses?

Timey wimey stuff sure is tricky, some times...

I think that the most severe clock-adjustment case from a congestion control perspective is that a clock that moves backwards will yield lower than expected RTT, or 0 RTT. I think this stall result should still be cached and then reset globally once normality returns, to avoid this as much as possible. Underestimating minRTT is very bad and can also cause stalls, in the same way (especially if NTP spoofing can trigger this on some platforms).

Hm. Given that's the case, maybe instead of setting an "our clock is bad, don't believe any RTTs till we've seen one in bounds!" flag, we could just discard the problematic RTT observation, or err on the side of high RTTs?

IIUC the reason to cache across circuits here is that—in between our definitely-bad RTT and our next definitely-good RTT— there might be bad RTTs that we can't detect as bad, but we want to discard them anyway?

If so, I'm wondering what such non-detectable bad RTTs would look like and where they would come from. If our clock has jumped or stalled, then all RTTs affected by the jump or stall should be detectably bad, I think?

Last time I asked wrt C-Tor, nobody remembers, and I could not could discern all of these cross-platform details, esp given shifting platform implementations of underlying clock APIs since it was written.

So, the code here uses regular c-tor monotime, which is either clock_gettime(CLOCK_MONOTONIC), QueryPerformanceCounter() plus a ratchet, mach_absolute_time(), or—if the operating system is extremely bogus and old—it can be gettimeofday() plus a ratchet.

Here is what can go wrong with these time implementations in theory, as I understand it:

If your system is under heavy load and your process isn't scheduled for a while, or if the process is put to sleep then reawakened, then you might see an apparent jump. That's not a bug: it's just the fact that you haven't checked the clock in a while.
These implementations do not behave uniformly when the system is suspended and restarted, either via power-management or via a VM host. Some of them pause the timer while the system isn't running; with others, the timer is adjusted for the lost time, and so the timer will appear to jump forward.
There have been various VM bugs over time that have caused hosted OSs to get non-monotonic "monotonic" time. The symptoms of these bugs are nigh-limitless in their variety. I have no idea how actually prevelent they are: I only know about them from reports from other implementors, as in the python and rust stdlibs. I cannot recall having seen these myself.
If you're using gettimeofday() plus a ratchet, nothing good will come of it. If the user sets their clock back, or NTP does it for them, they'll stall. (No other implementation is affected by the time setting.) If the user sets their clock forward, you'll jump. I have no idea if we even support such systems: all supported windows, linux, osx, android, ios, and bsd systems something better than gettimeofday().

In short, 1 and 2 above are real concerns; 3 is out there someplace but don't know if I've ever seen it; and 4 is probably possible if you've managed to get Tor to compile for Xenix or something, but probably not worth worrying about.

Do you have these kinds of details on the clock that arti uses?

Arti uses either std::time::Instant or coarsetime::Instant if it needs a timer, depending on performance needs. If we need a wallclock time, then it's std::time::SystemTime. The documentation for each of those is (or links to) about as much info as we have.

More info from the health team Exit node. It got stuck on May 10th but seem to unwedged itself from this state:

May 10 09:26:05.607 [notice] Sudden increase in circuit RTT (1426212 vs 221), likely due to clock jump.
May 10 09:26:25.735 [notice] Our clock has been stalled for the entire lifetime of a circuit. Performance may be sub-optimal.

Likely theory here is that if we already have circuit out of slow-start, it should recover because it would already have a previous valid RTT for at least 1 circuit.

David Goulet (@dgoulet):

David Goulet commented:

More info from the health team Exit node. It got stuck on May 10th but seem to unwedged itself from this state:
May 10 09:26:05.607 [notice] Sudden increase in circuit RTT (1426212 vs 221), likely due to clock jump.
May 10 09:26:25.735 [notice] Our clock has been stalled for the entire lifetime of a circuit. Performance may be sub-optimal.
Likely theory here is that if we already have circuit out of slow-start, it should recover because it would already have a previous valid RTT for at least 1 circuit.

I wonder how prevalent this issue is (both in the recoverable and non-recoverable version). Assuming it is visible on a bunch of nodes maybe this issue could be responsible for the used bw graph not rising as we expected?

The fact that you have to have no other active circuits that are out of slow start to trigger this makes me think the actual wedge case is both more rare and harder to trigger than we initially expected.

We should find some way to check for this wedge state though. I believe that if you are wedged, you stop getting further "`Sudden increase" logs, and start getting get many many many "Our clock has been stalled" messages. If you are not wedged, you might get a couple "Our clock has been stalled" message, but not many of them, and they will also cease.

As for impact on performance, once wedged, Exits will have abysmal performance. If this was common, we would be seeing a bump in the onionperf throughput CDFs at nearly zero throughput, for those Exits. We don't see that yet. For non-Exits, if they get wedged, directory downloads from them will be very slow. This could impact bootstrapping times, especially if it happens at directory mirrors or dirauths.

My working theory is that the upgrade of iOS onionbrowser to 0.4.7 this week created the conditions to start tickling this issue, and it is not under live exploitation, based on the fact that @dgoulet is not seeing this happen again immediately after restart.

Another followup. I was able to trigger the condition remotely on our Exit (d2d4) by withholding the second SENDME on the circuit. It doesn't get stuck because it has other circuit out of slow-start.

Then, I manage to trigger this on an onion service and this time, that one gets stuck because no active circuit outside of slow-start. I had to do a bit of work to get that in a stuck mode so I don't think it is 100% automatic stuck mode but very close if you have no active out of slow-start circuits.

This also confirms how trivial it is to trigger.

This also confirms how trivial it is to trigger.

To be more accurate:

It is easy to trigger on onion services with not much activity.
It may be possible to trigger on heavily used onion services, if you can DoS them such that other circuits die off/cant connect.
It is likely easy for malicious relays to trigger on clients, to slow all their activity down until a restart, but these clients would have to be uploading more than 32KB to get at least 2 sendmes back.
It is likely easy to trigger on non-Exits, since directory activity to them is infrequent enough for there not to be much CC activity there.
It is less clear if dirauths and full-fledged directory mirrors will be easy to trigger. They may have lots of concurrent activity.
It is very hard to trigger on fast Exits, since there is likely always a non slow start circuit there.
It might be possible to trigger on very slow Exits with no other activity at that time.

Have weasel been added in on the conversation on releases for deb.torproject.org ?

No, packagers have not yet been informed.

Ok, here is an MR for review: !589 (closed)

Based on @nickm's comments, and discussions with @dgoulet, I mostly got rid of the caching concept. The only place where we use the cached value now is in the RTT decrease case. It can only get set to true on a 0-delta.

This should eliminate the possibility of wedging, since now new circuits are allowed to progress with whatever RTTs they get until they exit slow start (unless they are all 0-delta).

This latest patch has my +1.

Comments on the patch:

We should mention mention the TROVE Id (and the CVE when we have one) on the changes file.
Other than that, this looks plausible to me. If Mike and David can't think of a way for it to go wrong, and if it works on a relay without freaking out, I think it's okay to merge.

So, +1.

I have been running this patch since yesterday on our Exit. So far so good at least.

Spec update MR: torspec!75 (merged)

@eta - New test vectors for the clock heuristics: https://gitlab.torproject.org/mikeperry/tor/-/commits/cc_test_vectors_047 (top fixup commit)

It is in 0.4.7.8

Activity