Don't set router is_running=false after intentionally closing a directory connection
In a testing tor network with a few relays, clients, and an onion service, the onion service will call run_upload_descriptor_event()
periodically to upload its service descriptor. This eventually calls directory_initiate_request()
which creates a new dir connection for the upload. In the future when run_upload_descriptor_event()
runs again, it will first call close_directory_connections()
to mark for close any existing/incomplete descriptor uploads for that service. Later (on the next 1 second libevent timer) the dir connection will be closed and since the upload didn't finish, connection_dir_client_request_failed()
will set the router's is_running
field to false.
The problem is that run_upload_descriptor_event()
can run shortly after a previous run, and in a shadow simulation, this can run only 2 seconds after a previous run. If the descriptor upload has not finished in this 2 seconds, the router will be marked as not running and will not be added to the routerlist when building circuits. Since this dir client request can fail often due to tor's new circuit timeout learning, in small tor networks we quickly run out of nodes in the routerlist, and end up with:
Jan 01 00:17:50.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as middle node
Jan 01 00:17:50.000 [info] router_choose_random_node(): We couldn't find any live, stable routers; falling back to list of all routers.
Jan 01 00:17:50.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as middle node
Jan 01 00:17:50.000 [warn] No available nodes when trying to choose node. Failing.
Jan 01 00:17:50.000 [info] pick_needed_intro_points(): Unable to find a suitable node to be an introduction point for service r4aj4kaqf46mala2yykldkvwrrwjagab2qppuqtvgdxwh6spsulwu2qd.
I propose to not mark a router as "not running" if tor intentionally closes a directory connection (except maybe for a TestingDirConnectionMaxStall
timeout).