built-in chutney networks don't use enough dirauths to reliably bootstrap
When trying to bootstrap a network that includes arti clients, I'm often seeing bootstrapping get stuck. Upon further inspection I'm not sure why or even if the arti clients have anything to do with it, but it seems to happen more reliably with them
Network script:
Authority = Node(tag="a", authority=1, relay=1)
ExitRelay = Node(tag="r", relay=1, exit=1)
TorClient = Node(tag="torc", client=1, backend=NodeBackend.TOR)
ArtiClient = Node(tag="artic", client=1, backend=NodeBackend.ARTI)
NODES = Authority.getN(3) + ExitRelay.getN(5) + TorClient.getN(1) + ArtiClient.getN(1)
ConfigureNodes(NODES)
Stuck bootstrap state:
Node status:
test000a : 0, starting , Starting
test001a : 0, starting , Starting
test002a : 0, starting , Starting
test003r : 95, circuit_create , Establishing a Tor circuit
test004r : 95, circuit_create , Establishing a Tor circuit
test005r : 95, circuit_create , Establishing a Tor circuit
test006r : 95, circuit_create , Establishing a Tor circuit
test007r : 95, circuit_create , Establishing a Tor circuit
test008torc : 95, circuit_create , Establishing a Tor circuit
test009artic : 100 (SUCCESS) , None , connecting successfully; directory is usable, fresh until 2000-01-01 00:05:00 UTC, and valid until 2000-01-01 00:05:20 UTC
Published dir info:
test000a : 100 (SUCCESS) , all nodes , DESC MD MD_CONS NS_CONS
test001a : 100 (SUCCESS) , all nodes , DESC MD MD_CONS NS_CONS
test002a : 100 (SUCCESS) , all nodes , DESC MD MD_CONS NS_CONS
test003r : 0 (NO_PROGRESS) , all nodes , DESC MD MD_CONS NS_CONS
test004r : 0 (NO_PROGRESS) , all nodes , DESC MD MD_CONS NS_CONS
test005r : 0 (NO_PROGRESS) , all nodes , DESC MD MD_CONS NS_CONS
test006r : 0 (NO_PROGRESS) , all nodes , DESC MD MD_CONS NS_CONS
test007r : 0 (NO_PROGRESS) , all nodes , DESC MD MD_CONS NS_CONS
@arma says, looking at test000a's info.log.gz:
ok, looking at the authority log first
Jan 01 00:00:12.000 [notice] Time to vote.
at this point it has three descriptors
Jan 01 00:00:16.000 [notice] Time to compute a consensus.
golly that is fast :)
Jan 01 00:00:20.000 [notice] Published microdesc consensus
Jan 01 00:00:20.000 [info] new_route_len(): Not enough acceptable routers (1/3 direct and 1/3 indirect routers suitable). Discarding this circuit.
this makes sense, 000a can't make circuits because it only knows 2 other relays at that point
...
this 000a auth is pretty sad with only two other relays in the network besides itself
and looking at test003r's info.log.gz:
Jan 01 00:00:39.001 [info] handle_response_fetch_consensus(): Received consensus
directory (body size 4141) from server 127.0.0.1:7102
Jan 01 00:00:41.001 [info] internal (high-uptime) circ (length 3, last hop test002a): $864432950B0DD66CCD117CAF160FA4459F02225D(open) $6F92F8DDB40456238920332273EB64E072BC1960(closed) $864432950B0DD66CCD117CAF160FA4459F02225D(closed)
that circuit is never going to work, because it has the same relay at the first and third hop
so the middle hop should refuse to extend back
i wonder why 003r is thinking that is an acceptable path
maybe because of the new guard rules, i don't know. or conflux. and maybe the rationale when that change went in was 'the chances you'll try to build a path like that are tiny so don't worry about it'
you could narrow down some of the possibilities by running the 3-auth one where the relays have conflux off
but i think it might be that internal circs use the new second-layer vanguard thing too
i think if the relay picks the wrong combination of guard and second-layer vanguards it is screwed
but that is still just a guess at this point
$ grep "Selected primary guard" info.log
see how it always picks test002a as its first hop
so yeah in conclusion i think you have found a weird edge case that happens when the network is only 3 relays, and you could investigate more or stop trying to have such a tiny network :)
I'm a bit unclear on some of the details here, particularly why bootstrapping does sometimes work with 3 authorities.
Increasing the number of directory authorities from 3 to 4 seems to make the problem reliably go away.
3 directory authorities + 1 bridge authority (for chutney creates a AlternateBridgeAuthority line but not a DirAuthority or AlternateDirAuthority line) doesn't seem to fix it. The bootstrap failure mode looks a little different, but I haven't dug into it further.
Modifying chutney to set VanguardsLiteEnabled 0 also seems to make the problem go away (but probably isn't the fix we want).