I'm going to start by summarizing my comparison of the algorithms.
The existing algorithm is a weird hybrid of decorrelated jitter and equal jitter. Each delay is at least the size of the preceding one, plus a random value capped at a fixed multiple of the preceding delay. This means in the worst case, d,,i+1,, = d,,i,, + d,,i,, * 3 + 1 = d,,i,, * 4 + 1, or approximately, d,,i,, = 4^i^.
For full jitter, the worst case is d,,i,, = base * m^i^.
For decorrelated jitter, the worst case is also d,,i,, = base *m^i^ but any small variation in an early delay step magnifies in later steps.
The examples in the blog show m=2 for full jitter and m=3 for decorrelated jitter. I'm not sure why they chose that variation.
The existing algorithm has some of the drawbacks of equal jitter, such as missing some possible time slots because it tries to always include a fixed delay component that's longer than the previous, so either decorrelated jitter or full jitter would be an improvement.
Trac: Summary: Use expontial backoff with jitter and/or tune its parameters to Use exponetial backoff with jitter and/or tune its parameters
I see that there is no longer a different multiplier between the test network case and the regular case. That's probably OK given that it looks like the mean and worst-case delay profiles become shorter after this change.
The functions would require fewer parameters and be more readable if we only implement one of the jittered backoff algorithms (probably decorrelated jitter). It would also avoid the large #ifdef blocks.
If someone could check that the math on my estimates above look reasonable that would be good too.
It seems that the test-network variant was added in 38e3f91c6388bc676c0007b00948, for #20597 (moved).
Based on the reasoning there, I think that it shouldn't be necessary to have that variant any more: previously, the maximum delay increase was 4x for production, 3x for test. Now, it's 3x for everyone: so it shouldn't make testnets any worse.
I've pushed an updated bug23816_029, with fixup commits to remove the "full jitter" algorithm.
Thanks! The updated patches look good to me. There are few minor nits with spacing around operators (e.g., INT_MAX/3, base_delay+1, low_bound=0, high_bound=max_delay) that you may wish to fix before merging, but they're probably not worth another round of review.
I wrote some rough simulations in Python and they do show a significant improvement in terms of delays growing less quickly with the new algorithm.
Okay! I've made a new bug23816_029_squashed with the fixup commits squashed, and I've merged it to maint-0.3.2 and forward. The merge commit was e5a83062ed4e9e6b908efc6b75f13ab269f97377 -- we needed a little hacking to get the tests to pass again. Marking for possible backport.
I am leaning "no backport" on this one because the old behavior, though poor, wasn't actually breaking anything that I'm aware of.... and because the changes are larger than I'd usually like to backport without a compelling bug. Please let me know if I'm wrong there.