tests: integration-shadow test fails after rand 0.9 upgrade

changed the description

mentioned in merge request !2875 (merged)

Do we know any more about why this is happening, or how it manifests?

The main suspicious log lines we're seeing are a flood of "tor_circmgr::hspool: Unable to build preemptive circuit for onion services: error: Unable to create vanguard manager: Cannot select vanguard with unbootstrapped vanguard manager" in the onion service arti instances. (despite earlier log line of "arti_client::status: 100%: connecting successfully; directory is usable, fresh until [...]").

@gabi-250 is currently debugging further

mentioned in merge request !2874 (merged)

The earliest commit in !2869 (merged) where the test passes again for me locally is a71a64ff, just before the actual rand crate version bump. i.e. it does appear to be the rand version bump itself (or one of the subsequent changes to fix the build) that triggers the breakage, and not one of the earlier commits that prepares for the bump.

(to be more precise: I first locally rebased to squash the squash! commits; I didn't try all of the individual commits in the original MR)

mentioned in merge request !2876 (closed)

I locally tried a couple different shadow rng seeds (passing --seed=32 to the shadow process), to see if there was latent breakage before this MR that could be triggered by otherwise perturbing the RNG. It doesn't appear so.

@ head: runs with 2 different seeds fail (though some of the otherwise failing clients succeed)
@ (revert rng update): runs with 2 different seeds pass

Another mostly-disproven hypothesis: sometimes shadow's default time model (in which CPU processing and most syscalls execute in 0 time) triggers otherwise rare bugs.

I tried running with a more realistic (but slower to simulate) time model (--model-unblocked-syscall-latency=true --max-unapplied-cpu-latency=1ns), but this didn't appear to fix anything. mostly disproving this hypothesis (though time spent executing instructions in userspace in between syscalls still takes 0 simulated time)

We found that the issue is coming from choose_multiple_weighed. It seems like in newer versions of rand, choose_multiple_weighed returns an error if there fewer than amount elements of non-zero weight.

The source of the bug was not immediately obvious because pick_n_relays currently discards any errors coming from choose_multiple_weighed.

Filed upstream: https://github.com/rust-random/rand/issues/1619

mentioned in merge request !2877 (merged)

mentioned in issue #1903 (closed)

Fixed by !2877 (merged); See #1903 (closed) for followup.

closed

tests: integration-shadow test fails after rand 0.9 upgrade

Designs

Child items 0

Activity