https://trac.torproject.org/projects/tor/ticket/6341#comment:34 shows a lot of socks timeouts from Tor clients in a Testing Tor Network. Apparently these clients didn't get enough directory info to establish circuits, so they just fail all their application requests. The issue is apparently exacerbated by legacy/trac#3196 (moved) where we demanded more descriptors be present before we consider ourselves bootstrapped.
Perhaps the real problem here is that we keep the normal dir fetch retry schedules even when TestingTorNetwork is set? It looks like TestingTorNetwork makes a new consensus every 5 minutes, but client_dl_schedule is "0, 0, 60, 605, 6010, INT_MAX".
Should we lower the retry schedules?
Has it been the case this whole time that clients in testing networks typically don't have all the descriptors they'd want?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
rransom suggested "0, 0, 15, 30, 60, 75, INT_MAX" as the replacement dir fetching schedule, noting "The last fetch should be at 3 minutes with that schedule (pretending for a moment that dir fetch attempts succeed or fail instantaneously)."
I wonder if we'd be happier with a hack that, rather than specifying alternate schedules, just divides all the interval values by 12 if TestingTorNetwork is on (since 60/5 = 12).
This one is another huge problem. It means that any client using TestingTorNetwork but where directory_fetches_dir_info_early(options) is false has to win its directory fetches on the first try or it'll be ten more minutes (two consensus periods) until it tries again.
I'm fine with any of the approaches described above in 0.2.4, though I think that rransom's replacement dir fetch interval idea is probably cleaner than a divide-by-five.
rHere's the diff I gave Chris, who used it successfully for the December demo:
diff --git a/src/or/directory.c b/src/or/directory.cindex 1d511b5..2ba5d54 100644--- a/src/or/directory.c+++ b/src/or/directory.c@@ -3616,7 +3616,7 @@ static const int server_dl_schedule[] = { }; /** Schedule for when clients should download things in general. */ static const int client_dl_schedule[] = {- 0, 0, 60, 60*5, 60*10, INT_MAX+ 0, 0, 5, 10, 15, 20, 30, 60 }; /** Schedule for when servers should download consensuses. */ static const int server_consensus_dl_schedule[] = {@@ -3624,7 +3624,7 @@ static const int server_consensus_dl_schedule[] = { }; /** Schedule for when clients should download consensuses. */ static const int client_consensus_dl_schedule[] = {- 0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12+ 0, 0, 5, 10, 15, 20, 30, 60 }; /** Schedule for when clients should download bridge descriptors. */ static const int bridge_dl_schedule[] = {
I also added
diff --git a/src/or/main.c b/src/or/main.cindex 446836a..e3b9345 100644--- a/src/or/main.c+++ b/src/or/main.c@@ -148,7 +148,7 @@ int can_complete_circuit=0; /** How often do we check for router descriptors that we should download * when we have too little directory info? */-#define GREEDY_DESCRIPTOR_RETRY_INTERVAL (10)+#define GREEDY_DESCRIPTOR_RETRY_INTERVAL (5) /** How often do we check for router descriptors that we should download * when we have enough directory info? */ #define LAZY_DESCRIPTOR_RETRY_INTERVAL (60)diff --git a/src/or/nodelist.c b/src/or/nodelist.cindex 95345fb..3b42994 100644--- a/src/or/nodelist.c+++ b/src/or/nodelist.c@@ -1345,10 +1345,10 @@ update_router_have_minimum_dir_info(void) /* What fraction of desired server descriptors do we need before we will * build circuits? */-#define FRAC_USABLE_NEEDED .75+#define FRAC_USABLE_NEEDED .5 /* What fraction of desired _exit_ server descriptors do we need before we * will build circuits? */-#define FRAC_EXIT_USABLE_NEEDED .5+#define FRAC_EXIT_USABLE_NEEDED .3 if (num_present < num_usable * FRAC_USABLE_NEEDED) { tor_snprintf(dir_info_status, sizeof(dir_info_status),diff --git a/src/or/routerlist.c b/src/or/routerlist.cindex 1735837..6688591 100644--- a/src/or/routerlist.c+++ b/src/or/routerlist.c@@ -3987,7 +3987,7 @@ initiate_descriptor_downloads(const routerstatus_t *source, #define MAX_DL_TO_DELAY 16 /** When directory clients have only a few servers to request, they batch * them until they have more, or until this amount of time has passed. */-#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST (10*60)+#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST 5 /** Given a <b>purpose</b> (FETCH_MICRODESC or FETCH_SERVERDESC) and a list of * router descriptor digests or microdescriptor digest256s in
You'll notice in the download schedules, I don't have any INT_MAX at the end -- it just keeps trying, often, for every descriptor. In a closed Tor network that should be safe to do.
More generally, there seem to be two use cases for TestingTorNetwork here: are you attempting to faithfully reproduce timing/etc problems from the real Tor network, or is the goal just-run-the-damn-Tor-network-and-make-it-work?
We could do this if a patch makes it in for the small-features deadline, but I don't know if I'll have time to write one.
More generally, there seem to be two use cases for TestingTorNetwork here: are you attempting to faithfully reproduce timing/etc problems from the real Tor network, or is the goal just-run-the-damn-Tor-network-and-make-it-work?
I use it for the latter mostly; but if people use it for the former, this needs more thought. Perhaps this option needs another name, and needs to be settable-only-when-TestingTorNetwork==1
I started looking into this today, and I think we should add new config options, e.g., TestingClientDownloadSchedule (accepting a CSV list), TestingClientConsensusDownloadSchedule (accepting a CSV list), and TestingClientMaxIntervalWithoutRequest (accepting an INTERVAL) that can only be changed if TestingTorNetwork is set. I hope to get away without changing all those other constants that arma changed in the diff he gave to Chris. The fewer new torrc options we add, the better. But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.
But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?
With respect to the use case where people attempt to faithfully reproduce timing problems: we're already changing plenty of timings in TestingTorNetwork mode. If this use case exists, people should manually reset timing-related options to non-TestingTorNetwork defaults. Not directly related to this issue though.
I started looking into this today, and I think we should add new config options, e.g., TestingClientDownloadSchedule (accepting a CSV list), TestingClientConsensusDownloadSchedule (accepting a CSV list), and TestingClientMaxIntervalWithoutRequest (accepting an INTERVAL) that can only be changed if TestingTorNetwork is set. I hope to get away without changing all those other constants that arma changed in the diff he gave to Chris. The fewer new torrc options we add, the better. But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.
I think that approach sounds reasonable to me.
But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?
I haven't run into this myself; maybe Rob would know? Sometimes Chutney gets into a state where the network needs to be restarted after the authorities bootstrap. You could try that; ping me if you need help.
But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?
I haven't run into this myself; maybe Rob would know? Sometimes Chutney gets into a state where the network needs to be restarted after the authorities bootstrap. You could try that; ping me if you need help.
I asked Chris about this problem, as I believe he has more experience with it than I. Here is his response (Note that he was not using Shadow):
We ran into that situation because of a slightly pathological case in our code. It happens frequently if descriptors get updated and pushed to the directories more frequently than the consensus period, which our code was doing. This significantly exacerbates the general problem by increasing the number of failures. I'm not sure if that's a viable test case though, since it's bad behavior (and we've changed our code to no longer do that).
The problem may be more generally reproducible simply by starting the directories at exactly the same time you start the clients. The directories won't have completed the consensus negotiation (assuming a TestingTorNetwork interval of 5 minutes) by the time the clients get into the 60*5 back off period, so the clients will back off for 10 minutes.
For this to work, you probably need 5 authoritative directories (to make sure their negotiations take a while).
I tried a Shadow network with 5 authorities and with clients starting at the same time as authorities, but I can't reproduce this situation. I applied this patch with a crazy retry schedule and with log messages to notice when clients switched to a different retry interval:
diff --git a/src/or/directory.c b/src/or/directory.cindex f235bf3..b654a85 100644--- a/src/or/directory.c+++ b/src/or/directory.c@@ -3625,7 +3625,8 @@ static const int server_dl_schedule[] = { }; /** Schedule for when clients should download things in general. */ static const int client_dl_schedule[] = {- 0, 0, 60, 60*5, 60*10, INT_MAX+ //0, 0, 60, 60*5, 60*10, INT_MAX+ 15, INT_MAX }; /** Schedule for when servers should download consensuses. */ static const int server_consensus_dl_schedule[] = {@@ -3633,7 +3634,8 @@ static const int server_consensus_dl_schedule[] = { }; /** Schedule for when clients should download consensuses. */ static const int client_consensus_dl_schedule[] = {- 0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12+ //0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12+ 15, INT_MAX }; /** Schedule for when clients should download bridge descriptors. */ static const int bridge_dl_schedule[] = {@@ -3708,14 +3710,14 @@ download_status_increment_failure(download_status_t *dls, int status_code, if (item) { if (increment == 0)- log_debug(LD_DIR, "%s failed %d time(s); I'll try again immediately.",+ log_info(LD_DIR, "XXX6752 %s failed %d time(s); I'll try again immediately.", item, (int)dls->n_download_failures); else if (dls->next_attempt_at < TIME_MAX)- log_debug(LD_DIR, "%s failed %d time(s); I'll try again in %d seconds.",+ log_info(LD_DIR, "XXX6752 %s failed %d time(s); I'll try again in %d seconds.", item, (int)dls->n_download_failures, (int)(dls->next_attempt_at-now)); else- log_debug(LD_DIR, "%s failed %d time(s); Giving up for a while.",+ log_info(LD_DIR, "XXX6752 %s failed %d time(s); Giving up for a while.", item, (int)dls->n_download_failures); } return dls->next_attempt_at;@@ -3738,6 +3740,8 @@ download_status_reset(download_status_t *dls) find_dl_schedule_and_len(dls, get_options()->DirPort_set, &schedule, &schedule_len);+ if (dls->n_download_failures)+ log_info(LD_DIR, "XXX6752 Resetting download status."); dls->n_download_failures = 0; dls->next_attempt_at = time(NULL) + schedule[0]; }