Skip to content

Draft: shadow ci: simulate multiple days and don't use TestingTorNetwork

Jim Newsome requested to merge jnewsome/arti:shadow-no-testing-tor-network into main

This still needs more work and cleanup, but may be useful in the meantime for further investigating HS descriptor publication issues.

shadow.yaml currently hard-codes the path to tgen on my machine; that'll need to be manually changed for others to run locally.

Things still needed before merging:

  • Fix tgen-path hard-coding issue. The problem is that shadow now runs a script instead of running tgen directly, and shadow doesn't pass the user's PATH into the script (by design, for reproducibility). I'll probably fix this by adding a wrapper script to dynamically generate shadow.yaml.
  • The simulation currently takes 20-30 minutes to run on my machine. Not bad for simulating 48h, but probably too slow for CI. I'll see if there's a way to speed this up. Otherwise this might need to be a separate test that runs e.g. nightly instead of on every push, merge, etc.
  • I'll have to rethink the transfer validation a bit. Right now there are a lot of failures for the arti HS in particular, so I'll need to set a very low required-success-rate. I also need to rework exactly how the validation is done, since the number of attempted transfers are now a function of how long the simulation runs (the clients run in a loop and periodically try do some number of transfers).
  • Without TestingTorNetwork, we need to do a bit of work to convince the authority to assign the HSDir flag. dirserv_thinks_router_is_hs_dir calls router_is_active, which (when TestingTorNetwork isn't set) checks for non-zero ri->bandwidthcapacity. As far as I can tell, this comes from the relay's self-reported peak bandwidth usage, which is zero at the start of the simulation. Currently some of the relays eventually report-nonzero (due to activity from non-HS clients) and then get the HSDir flag, but not all of them. I'm not sure the best way to handle this.
    • Ensure some synthetic traffic is handled by every relay in a bootstrap phase of the simulation? But this will add to the already high simulation overhead.
    • Do a "one time" bootstrap with extra synthetic traffic as above, but check in the result so that "normal" simulations can skip this step? This would bring the runtime down, but adds complexity and more "stuff" checked in.
    • Synthetically create or modify the state files to convince the relays that they've observed non-zero bandwidth?
    • Add an option to relays allowing them to report some minimum non-zero capacity, instead of having to weight to empirically measure?
    • Change the authority logic to not use self-reported-bandwidth-usage (ri->bandwidthcapacity) as a proxy for "active" (in router_is_active)? This is a little more compelling than changing the relay logic, since only the authority would need to be running a new version of tor that has this change. The current logic is also a bit non-obvious.
Edited by Jim Newsome

Merge request reports