Use exponential backoff with jitter and/or tune its parameters

changed milestone to %Tor: 0.3.2.x-final

added 029-backport-maybe 031-backport-maybe component::core tor/tor milestone::Tor: 0.3.2.x-final owner::nickm priority::medium resolution::fixed reviewer::catalyst severity::normal sponsor::V status::closed type::defect labels

Trac:
Owner: N/A to nickm
Status: new to accepted

They recommend one of two approaches. "Full Jitter":

sleep = random_between(0, min(cap, base * 2 ** attempt))

Or "Decorrelated Jitter":

sleep = min(cap, random_between(base, sleep * 3))

bug23816_029 has an implementation of the two suggested algorithms, for merge into 0.2.9 or later.

I don't know what the changes file should say here. :)

Trac:
Status: accepted to needs_review

review-group-24 is now open.

Trac:
Keywords: N/A deleted, review-group-24 added

Trac:
Sponsor: N/A to SponsorV

Trac:
Reviewer: N/A to catalyst

I'm going to start by summarizing my comparison of the algorithms.

The existing algorithm is a weird hybrid of decorrelated jitter and equal jitter. Each delay is at least the size of the preceding one, plus a random value capped at a fixed multiple of the preceding delay. This means in the worst case, d,,i+1,, = d,,i,, + d,,i,, * 3 + 1 = d,,i,, * 4 + 1, or approximately, d,,i,, = 4^i^.

For full jitter, the worst case is d,,i,, = base * m^i^.

For decorrelated jitter, the worst case is also d,,i,, = base *m^i^ but any small variation in an early delay step magnifies in later steps.

The examples in the blog show m=2 for full jitter and m=3 for decorrelated jitter. I'm not sure why they chose that variation.

The existing algorithm has some of the drawbacks of equal jitter, such as missing some possible time slots because it tries to always include a fixed delay component that's longer than the previous, so either decorrelated jitter or full jitter would be an improvement.

Trac:
Summary: Use expontial backoff with jitter and/or tune its parameters to Use exponetial backoff with jitter and/or tune its parameters

Trac:
Summary: Use exponetial backoff with jitter and/or tune its parameters to Use exponential backoff with jitter and/or tune its parameters

rough estimate of mean delays:

existing: d,,i,, = base * ((m+1)/2)^i^ [ m = 3 regular; m = 2 testnet ]
full jitter: d,,i,, = base * m^i^/2 [ m = 2 per AWS blog ]
decorrelated jitter: d,,i,, = base * (m/2)^i^ [ m = 3 per AWS blog ]

It looks like going to decorrelated jitter will cause the mean delays to increase more slowly than they already are, which seems like a helpful thing.

I see that there is no longer a different multiplier between the test network case and the regular case. That's probably OK given that it looks like the mean and worst-case delay profiles become shorter after this change.

The functions would require fewer parameters and be more readable if we only implement one of the jittered backoff algorithms (probably decorrelated jitter). It would also avoid the large #ifdef blocks.

If someone could check that the math on my estimates above look reasonable that would be good too.

Trac:
Status: needs_review to needs_revision

It seems that the test-network variant was added in 38e3f91c6388bc676c0007b00948, for #20597 (moved).

Based on the reasoning there, I think that it shouldn't be necessary to have that variant any more: previously, the maximum delay increase was 4x for production, 3x for test. Now, it's 3x for everyone: so it shouldn't make testnets any worse.

I've pushed an updated bug23816_029, with fixup commits to remove the "full jitter" algorithm.

Trac:
Status: needs_revision to needs_review

Thanks! The updated patches look good to me. There are few minor nits with spacing around operators (e.g., INT_MAX/3, base_delay+1, low_bound=0, high_bound=max_delay) that you may wish to fix before merging, but they're probably not worth another round of review.

I wrote some rough simulations in Python and they do show a significant improvement in terms of delays growing less quickly with the new algorithm.

Trac:
Status: needs_review to merge_ready

Okay! I've made a new bug23816_029_squashed with the fixup commits squashed, and I've merged it to maint-0.3.2 and forward. The merge commit was e5a83062ed4e9e6b908efc6b75f13ab269f97377 -- we needed a little hacking to get the tests to pass again. Marking for possible backport.

Trac:
Keywords: review-group-24 deleted, N/A added
Milestone: Tor: 0.3.2.x-final to Tor: 0.3.1.x-final

I am leaning "no backport" on this one because the old behavior, though poor, wasn't actually breaking anything that I'm aware of.... and because the changes are larger than I'd usually like to backport without a compelling bug. Please let me know if I'm wrong there.

Trac:
Keywords: N/A deleted, 029-backport-maybe, 031-backport-maybe added

No backport, per reasons discussed above.

Trac:
Status: merge_ready to closed
Milestone: Tor: 0.3.1.x-final to Tor: 0.3.2.x-final
Resolution: N/A to fixed

closed

moved to tpo/core/tor#23816 (closed)

Use exponential backoff with jitter and/or tune its parameters

Child items ...

Activity