TestingTorNetwork doesn't lower the dir fetch retry schedules

changed milestone to %Tor: 0.2.5.x-final in legacy/trac

added component::core tor/tor in Legacy / Trac milestone::Tor: 0.2.5.x-final in Legacy / Trac parent::7172 in Legacy / Trac priority::high in Legacy / Trac resolution::implemented in Legacy / Trac small-feature in Legacy / Trac status::closed in Legacy / Trac tor-client in Legacy / Trac type::enhancement in Legacy / Trac labels

puppetor had a hupuntilup function or something to work around this. Maybe that's a quick fix for shadow, too

Replying to Sebastian:

puppetor had a hupuntilup function or something to work around this. Maybe that's a quick fix for shadow, too

Hupping Tor doesn't change its dir fetching plans these days. So I bet that hack doesn't work in puppetor anymore either.

rransom suggested "0, 0, 15, 30, 60, 75, INT_MAX" as the replacement dir fetching schedule, noting "The last fetch should be at 3 minutes with that schedule (pretending for a moment that dir fetch attempts succeed or fail instantaneously)."

I wonder if we'd be happier with a hack that, rather than specifying alternate schedules, just divides all the interval values by 12 if TestingTorNetwork is on (since 60/5 = 12).

routerlist.c:#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST (10*60)

This one is another huge problem. It means that any client using TestingTorNetwork but where directory_fetches_dir_info_early(options) is false has to win its directory fetches on the first try or it'll be ten more minutes (two consensus periods) until it tries again.

I'm fine with any of the approaches described above in 0.2.4, though I think that rransom's replacement dir fetch interval idea is probably cleaner than a divide-by-five.

Trac:
Keywords: N/A deleted, tor-client added

Trac:
Component: Tor Client to Tor

I think the saferlab folks ran into this bug today.

Trac:
Parent: N/A to legacy/trac#7172 (moved)

Trac:
Cc: robgjansen to robgjansen, karsten

rHere's the diff I gave Chris, who used it successfully for the December demo:

diff --git a/src/or/directory.c b/src/or/directory.c
index 1d511b5..2ba5d54 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c
@@ -3616,7 +3616,7 @@ static const int server_dl_schedule[] = {
 };
 /** Schedule for when clients should download things in general. */
 static const int client_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, INT_MAX
+  0, 0, 5, 10, 15, 20, 30, 60
 };
 /** Schedule for when servers should download consensuses. */
 static const int server_consensus_dl_schedule[] = {
@@ -3624,7 +3624,7 @@ static const int server_consensus_dl_schedule[] = {
 };
 /** Schedule for when clients should download consensuses. */
 static const int client_consensus_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12
+  0, 0, 5, 10, 15, 20, 30, 60
 };
 /** Schedule for when clients should download bridge descriptors. */
 static const int bridge_dl_schedule[] = {

I also added

diff --git a/src/or/main.c b/src/or/main.c
index 446836a..e3b9345 100644
--- a/src/or/main.c
+++ b/src/or/main.c
@@ -148,7 +148,7 @@ int can_complete_circuit=0;
 
 /** How often do we check for router descriptors that we should download
  * when we have too little directory info? */
-#define GREEDY_DESCRIPTOR_RETRY_INTERVAL (10)
+#define GREEDY_DESCRIPTOR_RETRY_INTERVAL (5)
 /** How often do we check for router descriptors that we should download
  * when we have enough directory info? */
 #define LAZY_DESCRIPTOR_RETRY_INTERVAL (60)
diff --git a/src/or/nodelist.c b/src/or/nodelist.c
index 95345fb..3b42994 100644
--- a/src/or/nodelist.c
+++ b/src/or/nodelist.c
@@ -1345,10 +1345,10 @@ update_router_have_minimum_dir_info(void)
 
 /* What fraction of desired server descriptors do we need before we will
  * build circuits? */
-#define FRAC_USABLE_NEEDED .75
+#define FRAC_USABLE_NEEDED .5
 /* What fraction of desired _exit_ server descriptors do we need before we
  * will build circuits? */
-#define FRAC_EXIT_USABLE_NEEDED .5
+#define FRAC_EXIT_USABLE_NEEDED .3
 
   if (num_present < num_usable * FRAC_USABLE_NEEDED) {
     tor_snprintf(dir_info_status, sizeof(dir_info_status),
diff --git a/src/or/routerlist.c b/src/or/routerlist.c
index 1735837..6688591 100644
--- a/src/or/routerlist.c
+++ b/src/or/routerlist.c
@@ -3987,7 +3987,7 @@ initiate_descriptor_downloads(const routerstatus_t *source,
 #define MAX_DL_TO_DELAY 16
 /** When directory clients have only a few servers to request, they batch
  * them until they have more, or until this amount of time has passed. */
-#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST (10*60)
+#define MAX_CLIENT_INTERVAL_WITHOUT_REQUEST 5
 
 /** Given a <b>purpose</b> (FETCH_MICRODESC or FETCH_SERVERDESC) and a list of
  * router descriptor digests or microdescriptor digest256s in

You'll notice in the download schedules, I don't have any INT_MAX at the end -- it just keeps trying, often, for every descriptor. In a closed Tor network that should be safe to do.

More generally, there seem to be two use cases for TestingTorNetwork here: are you attempting to faithfully reproduce timing/etc problems from the real Tor network, or is the goal just-run-the-damn-Tor-network-and-make-it-work?

We could do this if a patch makes it in for the small-features deadline, but I don't know if I'll have time to write one.

More generally, there seem to be two use cases for TestingTorNetwork here: are you attempting to faithfully reproduce timing/etc problems from the real Tor network, or is the goal just-run-the-damn-Tor-network-and-make-it-work?

I use it for the latter mostly; but if people use it for the former, this needs more thought. Perhaps this option needs another name, and needs to be settable-only-when-TestingTorNetwork==1

Trac:
Keywords: tor-client deleted, tor-client small-feature added

Trac:
Type: defect to enhancement

I started looking into this today, and I think we should add new config options, e.g., TestingClientDownloadSchedule (accepting a CSV list), TestingClientConsensusDownloadSchedule (accepting a CSV list), and TestingClientMaxIntervalWithoutRequest (accepting an INTERVAL) that can only be changed if TestingTorNetwork is set. I hope to get away without changing all those other constants that arma changed in the diff he gave to Chris. The fewer new torrc options we add, the better. But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.

But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?

With respect to the use case where people attempt to faithfully reproduce timing problems: we're already changing plenty of timings in TestingTorNetwork mode. If this use case exists, people should manually reset timing-related options to non-TestingTorNetwork defaults. Not directly related to this issue though.

Replying to karsten:

I started looking into this today, and I think we should add new config options, e.g., TestingClientDownloadSchedule (accepting a CSV list), TestingClientConsensusDownloadSchedule (accepting a CSV list), and TestingClientMaxIntervalWithoutRequest (accepting an INTERVAL) that can only be changed if TestingTorNetwork is set. I hope to get away without changing all those other constants that arma changed in the diff he gave to Chris. The fewer new torrc options we add, the better. But I think we'll have to create separate options for these things, rather than magically changing timings when TestingTorNetwork is set.

I think that approach sounds reasonable to me.

But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?

I haven't run into this myself; maybe Rob would know? Sometimes Chutney gets into a state where the network needs to be restarted after the authorities bootstrap. You could try that; ping me if you need help.

Replying to nickm:

Replying to karsten:

But before I write a patch, how would I reproduce the situation where clients don't bootstrap because of too high dir fetch retry schedules? I tried a tiny-m1.large network with Tor 0.2.3.25, but scallion.log looks normal to me. What log messages would I look for? Or how would I change the configuration to reproduce the problem?

I haven't run into this myself; maybe Rob would know? Sometimes Chutney gets into a state where the network needs to be restarted after the authorities bootstrap. You could try that; ping me if you need help.

I asked Chris about this problem, as I believe he has more experience with it than I. Here is his response (Note that he was not using Shadow): We ran into that situation because of a slightly pathological case in our code. It happens frequently if descriptors get updated and pushed to the directories more frequently than the consensus period, which our code was doing. This significantly exacerbates the general problem by increasing the number of failures. I'm not sure if that's a viable test case though, since it's bad behavior (and we've changed our code to no longer do that).

The problem may be more generally reproducible simply by starting the directories at exactly the same time you start the clients. The directories won't have completed the consensus negotiation (assuming a TestingTorNetwork interval of 5 minutes) by the time the clients get into the 60*5 back off period, so the clients will back off for 10 minutes.

For this to work, you probably need 5 authoritative directories (to make sure their negotiations take a while).

I tried a Shadow network with 5 authorities and with clients starting at the same time as authorities, but I can't reproduce this situation. I applied this patch with a crazy retry schedule and with log messages to notice when clients switched to a different retry interval:

diff --git a/src/or/directory.c b/src/or/directory.c
index f235bf3..b654a85 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c
@@ -3625,7 +3625,8 @@ static const int server_dl_schedule[] = {
 };
 /** Schedule for when clients should download things in general. */
 static const int client_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, INT_MAX
+  //0, 0, 60, 60*5, 60*10, INT_MAX
+  15, INT_MAX
 };
 /** Schedule for when servers should download consensuses. */
 static const int server_consensus_dl_schedule[] = {
@@ -3633,7 +3634,8 @@ static const int server_consensus_dl_schedule[] = {
 };
 /** Schedule for when clients should download consensuses. */
 static const int client_consensus_dl_schedule[] = {
-  0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12
+  //0, 0, 60, 60*5, 60*10, 60*30, 60*60, 60*60, 60*60, 60*60*3, 60*60*6, 60*60*12
+  15, INT_MAX
 };
 /** Schedule for when clients should download bridge descriptors. */
 static const int bridge_dl_schedule[] = {
@@ -3708,14 +3710,14 @@ download_status_increment_failure(download_status_t *dls, int status_code,
 
   if (item) {
     if (increment == 0)
-      log_debug(LD_DIR, "%s failed %d time(s); I'll try again immediately.",
+      log_info(LD_DIR, "XXX6752 %s failed %d time(s); I'll try again immediately.",
                 item, (int)dls->n_download_failures);
     else if (dls->next_attempt_at < TIME_MAX)
-      log_debug(LD_DIR, "%s failed %d time(s); I'll try again in %d seconds.",
+      log_info(LD_DIR, "XXX6752 %s failed %d time(s); I'll try again in %d seconds.",
                 item, (int)dls->n_download_failures,
                 (int)(dls->next_attempt_at-now));
     else
-      log_debug(LD_DIR, "%s failed %d time(s); Giving up for a while.",
+      log_info(LD_DIR, "XXX6752 %s failed %d time(s); Giving up for a while.",
                 item, (int)dls->n_download_failures);
   }
   return dls->next_attempt_at;
@@ -3738,6 +3740,8 @@ download_status_reset(download_status_t *dls)
   find_dl_schedule_and_len(dls, get_options()->DirPort_set,
                            &schedule, &schedule_len);
 
+  if (dls->n_download_failures)
+    log_info(LD_DIR, "XXX6752 Resetting download status.");
   dls->n_download_failures = 0;
   dls->next_attempt_at = time(NULL) + schedule[0];
 }

Here's the result:

$ grep webclient1 data/scallion.log | grep XXX6752
0:0:10:543339 [thread-0] 0:3:2:000000010 [scallion-info] [webclient1-82.1.0.0] [intercept_logv] [info] void download_status_reset(download_status_t *)() XXX6752 Resetting download status.
0:0:13:251122 [thread-0] 0:6:2:000000011 [scallion-info] [webclient1-82.1.0.0] [intercept_logv] [info] void download_status_reset(download_status_t *)() XXX6752 Resetting download status.

Note that there are no failures in those logs. Also, clients bootstrap just fine, though it takes 10 simulated minutes to do so:

$ grep webclient1 data/scallion.log | grep Bootstrap
0:0:2:638041 [thread-0] 0:0:2:000000000 [scallion-info] [webclient1-82.1.0.0] [intercept_logv] [info] Bootstrapped 0%: Starting.
0:0:9:968690 [thread-0] 0:1:3:000000000 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 5%: Connecting to directory server.
0:0:9:974529 [thread-0] 0:1:3:061000001 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 10%: Finishing handshake with directory server.
0:0:9:990886 [thread-0] 0:1:3:247299267 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
0:0:9:995853 [thread-0] 0:1:3:325961616 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 20%: Asking for networkstatus consensus.
0:0:10:001487 [thread-0] 0:1:3:397083916 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 25%: Loading networkstatus consensus.
0:0:15:171272 [thread-0] 0:7:9:696519922 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 40%: Loading authority key certs.
0:0:15:193358 [thread-0] 0:7:9:822844471 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 45%: Asking for relay descriptors.
0:0:17:437195 [thread-0] 0:10:8:851821396 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 80%: Connecting to the Tor network.
0:0:17:437503 [thread-0] 0:10:8:851821396 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 90%: Establishing a Tor circuit.
0:0:17:487530 [thread-0] 0:10:9:651673193 [scallion-message] [webclient1-82.1.0.0] [intercept_logv] [notice] Bootstrapped 100%: Done.

How would I force clients to make just two attempts, with the second attempt happening 15 seconds after the first, and then wait forever?

TestingTorNetwork doesn't lower the dir fetch retry schedules

Child items ...

Activity