The main suspicious log lines we're seeing are a flood of "tor_circmgr::hspool: Unable to build preemptive circuit for onion services: error: Unable to create vanguard manager: Cannot select vanguard with unbootstrapped vanguard manager" in the onion service arti instances. (despite earlier log line of "arti_client::status: 100%: connecting successfully; directory is usable, fresh until [...]").
The earliest commit in !2869 (merged) where the test passes again for me locally is a71a64ff, just before the actual rand crate version bump. i.e. it does appear to be the rand version bump itself (or one of the subsequent changes to fix the build) that triggers the breakage, and not one of the earlier commits that prepares for the bump.
I locally tried a couple different shadow rng seeds (passing --seed=32 to the shadow process), to see if there was latent breakage before this MR that could be triggered by otherwise perturbing the RNG. It doesn't appear so.
@ head: runs with 2 different seeds fail (though some of the otherwise failing clients succeed)
@ (revert rng update): runs with 2 different seeds pass
Another mostly-disproven hypothesis: sometimes shadow's default time model (in which CPU processing and most syscalls execute in 0 time) triggers otherwise rare bugs.
I tried running with a more realistic (but slower to simulate) time model (--model-unblocked-syscall-latency=true--max-unapplied-cpu-latency=1ns), but this didn't appear to fix anything. mostly disproving this hypothesis (though time spent executing instructions in userspace in between syscalls still takes 0 simulated time)
We found that the issue is coming from choose_multiple_weighed. It seems like in newer versions of rand, choose_multiple_weighed returns an error if there fewer than amount elements of non-zero weight.