Nonzero bandwidthcapacity requirement for active relays hinders bootstrapping a Tor network
Context: I'm working on configuring shadow test simulations that do not use TestingTorNetwork
. This in turn is motivated by TestingTorNetwork
causing undocumented and non-overridable changes in behavior, which can make it difficult to replicate production behavior inside simulations and vice versa.
A problem I'm running into is that when TestingTorNetwork
is not set, router_is_active
requires that the relay's self-reported bandwidthcapacity
is non-zero. This in turn prevents such relays from getting assigned the HsDir
flag, which in turn prevents hidden services from working. (It could also be causing other issues).
bandwidthcapacity
roughly corresponds to the max observed bandwidth usage of the relay in some recent window.
Normally in shadow simulations, we use AssumeReachable 1
. This causes relays to skip self-testing, and since they haven't transferred any data over Tor circuits yet, upload a descriptor with bandwidthcapacity=0
. Without TestingTorNetwork
this causes router_is_active
to return false for such relays (i.e. all of them). If we have some synthetic test traffic in the network (that doesn't depend on hidden services), eventually relays that saw some test traffic upload descriptors with non-zero bandwidthcapacity
, but this can take a while and isn't very dependable.
Conversely, if we don't set AssumeReachable 1
, then relays never upload a descriptor; presumably because there are no relays in the consensus besides the authority, preventing them from performing a successful self test.
Some potential solutions, roughly in descending order of preference:
- Remove the
ri->bandwidthcapacity
check fromrouter_is_active
. It's not clear why this is needed on top of e.g. the check forri->is_hibernating
. This condition was added in 962765a3. That commit references #13000 (closed), but it's not clear to me that it's actually needed to address that problem (i.e. whether relays with bandwidthcapacity=0 are considered active by the authority is orthogonal from whether relays choose to publish their descriptor before doing their bw self test). - Add some option for relays to report a non-zero bandwidthcapacity initially, for network bootstrap purposes. This seems like potentially a can of worms, and would directly reverse what #13000 (closed) was trying to accomplish.
- There might be a way to carefully orchestrate network bootstrap; e.g. start some "bootstrap relays" with
AssumeReachable 1
, then start a new set of relays withoutAssumeReachable 1
that can use the first set to perform their self test. I haven't verified that this would work, and it'd be a nontrivial bit of extra complexity and cost for doing this in shadow simulations. - Change the bandwidth self-test, or create some new alternate bandwidth self-test, that just uses a 1-hop circuit. I've confirmed that even if no relays use
AssumeReachable 1
, the authority is in the initial consensus, so presumably other relays could then use the authority to do their self-tests. This would make the authority a bottleneck for these tests, though that could be mitigated by staggering relay starts.