undefined (#19969)

I am guessing the Orbot folks will find this bug really important.

Trac:
Keywords: N/A deleted, regression added

Trac:
Milestone: N/A to Tor: 0.2.8.x-final

This seems to happen much more often with only one bridge configured. Could it be related to legacy/trac#19989 (moved)?

I am the original bug reporter. Over the last weeks I used version 0.2.7.6 and I did not ran into the timeout in thunderbird (I guess this is 2 minutes) and so had to wait for tor opening circuits. However tor still did not open new circuits immediately after standby. Until today I never saw this as a bug, but now I think tor should open new circuits immediately as it does after a (re)boot. Today I tested version 0.2.4.27 and it behaved just like 0.2.7.6.

So in conclusion there is a regression in 0.2.8, but from my pov the bug was there for a long time.

Trac:
Username: viktorj

Ok, it just happened to me (running "Tor 0.2.9.3-alpha-dev (git-bfaded91)"). My tor client has been on an airplane all day, firewalled from the internet. Now I made a request to it, and it just sat there -- it received the request but did not attempt to make any circuits or anything.

I connected over the control port and asked it 'getinfo circuit-status' and it has no circuits open.

It is supposed to say

        log_fn(severity, LD_APP|LD_DIR,
               "Application request when we haven't %s. "
               "Optimistically trying directory fetches again.",
               !router_have_minimum_dir_info() ?
               "used client functionality lately" :
               "received a consensus with exits");

and then start making circuits and stuff.

It's weird that it didn't even try to make a circuit for the new stream request though.

Next piece of the mystery:

Oct 25 18:51:45.580 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.
Oct 25 21:58:43.520 [notice] Your system clock just jumped 11174 seconds forward; assuming established circuits no longer work.
Oct 25 21:58:54.825 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.

How come my Tor client didn't spring into action after it noticed the clock jump?

In 0.2.8, we redesigned the timer table. And we redesigned how we retrieve new consensuses when our consensus has expired. Maybe we need to tweak one or more of these things.

I think we implemented expotential backoff in 0.2.9. Not sure if that will make it better or worse.

Trac:
Milestone: Tor: 0.2.8.x-final to Tor: 0.3.0.x-final
Priority: Medium to High

Trac:
Username: viktorj
Cc: N/A to viktor_jaegerskuepper@freenet.de

Ok, I spent a while investigating this ticket yesterday. For context, the Tor client on my travel laptop is still stuck in this bug, so I can poke it in various ways to help debug if we think of useful ways. Here is a checkpoint of my thinking process.

There are two mysteries here:

Mystery 1) Why does my Tor client not care to get any new directory information?

It's October 29 now, and my tor's datadirectory contains:

-rw-------  1 arma arma 1405537 Oct 21 12:14 cached-microdesc-consensus
-rw-------  1 arma arma 3473955 Oct 21 12:43 cached-microdescs
-rw-------  1 arma arma       0 Oct 21 12:43 cached-microdescs.new

After one of the clock jump events, my Tor tried to fetch a consensus from a relay in the fallbackdir list, then it fell back to trying to fetch a consensus from each of the directory authorities, and when one attempt failed for each of those, it seems to have permanently given up -- and that was days ago. I'm going to leave this mystery for later -- somebody please feel free to pick it up in the meantime.

Mystery 2) Why does my Tor client not recognize the new application (socks) requests and jumpstart itself into trying to fetch new directory stuff, make new circuits, and so on?

My investigations led me to be suspicious of Tor commit b1d56fc58.

Background: in the past a new socks request would come in, and connection_ap_handshake_rewrite_and_attach() would get called on it, which would call connection_ap_handshake_attach_circuit(), which would call circuit_get_open_circ_or_launch(), and that function is the one that does the "omg I don't have enough directory information and I just got a socks request, I should reset everything so we can try again" logic.

But in commit b1d56fc58, we made connection_ap_handshake_rewrite_and_attach() no longer attach it, but rather it now calls connection_ap_mark_as_pending_circuit(), which effectively queues the stream for handling the next time the main loop runs. That function adds the stream to the pending_entry_connections smartlist, and sets untried_pending_connections to 1, and returns.

Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.

But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!

The earlier parts of connection_ap_handshake_rewrite_and_attach() are being called, for example rep_hist_note_used_port() gets called just fine. So I think the stream is being added to pending_entry_connections, and then somehow things aren't working after that.

Trac:
Points: N/A to 2

Replying to arma:

Ok, I spent a while investigating this ticket yesterday. For context, the Tor client on my travel laptop is still stuck in this bug, so I can poke it in various ways to help debug if we think of useful ways. Here is a checkpoint of my thinking process.

There are two mysteries here:

Mystery 1) Why does my Tor client not care to get any new directory information?

It's October 29 now, and my tor's datadirectory contains: {{{ -rw------- 1 arma arma 1405537 Oct 21 12:14 cached-microdesc-consensus -rw------- 1 arma arma 3473955 Oct 21 12:43 cached-microdescs -rw------- 1 arma arma 0 Oct 21 12:43 cached-microdescs.new }}} After one of the clock jump events, my Tor tried to fetch a consensus from a relay in the fallbackdir list, then it fell back to trying to fetch a consensus from each of the directory authorities, and when one attempt failed for each of those, it seems to have permanently given up -- and that was days ago. I'm going to leave this mystery for later -- somebody please feel free to pick it up in the meantime.

This sounds like an issue with the consensus download implementation. Here are my questions:

has your tor has marked each of the directory authorities down?
has it marked all of the fallback directories down?
has it exceeded the download schedule for downloading consensuses?

There are two kinds of download_status_t in 0.2.8, one retries on a schedule regardless of success or failure, and the other retries on failure. It seems there might be a problem with the first kind (which is new in 0.2.8, and used to download the consensus).

I'd like to know what your consensus download_status_t has for all its fields, particularly the attempt and failure counts. It may well have exceeded the 255 failure count limit.

I also wonder how legacy/trac#8625 (moved) (commit 09a0f2d0) affects this, but it is only in 0.2.9, and I'm not sure it would make it any better.

Mystery 2) Why does my Tor client not recognize the new application (socks) requests and jumpstart itself into trying to fetch new directory stuff, make new circuits, and so on?

My investigations led me to be suspicious of Tor commit b1d56fc58.

Background: in the past a new socks request would come in, and connection_ap_handshake_rewrite_and_attach() would get called on it, which would call connection_ap_handshake_attach_circuit(), which would call circuit_get_open_circ_or_launch(), and that function is the one that does the "omg I don't have enough directory information and I just got a socks request, I should reset everything so we can try again" logic.

But in commit b1d56fc58, we made connection_ap_handshake_rewrite_and_attach() no longer attach it, but rather it now calls connection_ap_mark_as_pending_circuit(), which effectively queues the stream for handling the next time the main loop runs. That function adds the stream to the pending_entry_connections smartlist, and sets untried_pending_connections to 1, and returns.

Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.

But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!

The earlier parts of connection_ap_handshake_rewrite_and_attach() are being called, for example rep_hist_note_used_port() gets called just fine. So I think the stream is being added to pending_entry_connections, and then somehow things aren't working after that.

This might be related to the above bug, or it might not be. This code would need to reset the authority and fallback failures, and also the download status for the consensus.

Replying to teor:

Replying to arma:

Mystery 1) Why does my Tor client not care to get any new directory information? This sounds like an issue with the consensus download implementation. Here are my questions:

has your tor has marked each of the directory authorities down?

has it marked all of the fallback directories down?

has it exceeded the download schedule for downloading consensuses?

There are two kinds of download_status_t in 0.2.8, one retries on a schedule regardless of success or failure, and the other retries on failure. It seems there might be a problem with the first kind (which is new in 0.2.8, and used to download the consensus).

I'd like to know what your consensus download_status_t has for all its fields, particularly the attempt and failure counts. It may well have exceeded the 255 failure count limit.

I also wonder how legacy/trac#8625 (moved) (commit 09a0f2d0) affects this, but it is only in 0.2.9, and I'm not sure it would make it any better.

To be clear this is "Tor 0.2.9.3-alpha-dev (git-bfaded91)" for me. So we are definitely in 0.2.9 territory.

But, I am going to move the mystery 1 investigation over to legacy/trac#20499 (moved), which looks like it is the same mystery over there. Then we can focus on mystery 2 in this ticket -- i.e. regardless of what our directory download situation is, Tor is supposed to wake up and reset stuff when it gets a new stream request and it doesn't have enough directory information to handle it.

Replying to arma:

My investigations led me to be suspicious of Tor commit b1d56fc58. [...] Then run_main_loop_once() calls connection_ap_attach_pending(0), which goes through and calls connection_ap_handshake_attach_circuit() on each stream that needs it.

But when I gdb attach to my unhappy Tor client and set a breakpoint on connection_ap_attach_pending and then do a new socks request, it seems that it never calls the function!

Still suspicious of commit b1d56fc58.

Is it possible that event_base_loop() in run_main_loop_once() is never finishing for me? After all, it has a bunch of callbacks, like second_elapsed_callback() and check_descriptor_callback() and so on, which all schedule themselves to get put back in the queue once they're done. Does this event_base_loop function keep running until there are no more events?

I see some ways it can choose to finish, like if flags is non-zero, or if connection_start_reading_from_linked_conn() calls tor_event_base_loopexit(), but neither of those apply to me here.

maybe set a breakpoint on the outside of event_base_loop() and see if we reach there? Though we should be reaching there if we get any incoming connections, since they count as events.

I set breakpoints all over run_main_loop_once() and they never catch. Maybe my gcc has optimized the function so it's hard to gdb on, but it looks suspiciously like event_base_loop(base, 0) just never finishes.

Where in the incoming connections events does it tell event_base_loop that it should finish?

In theory, calls to event_base_loopbreak() or event_base_loopexit() will make event_base_loop() stop. But yeah, we don't seem to do that at an appropriate time.

I'll write a patch on 0.2.8 to test this.

Okay; I think that branch bug19969_028 in my public repository may have a fix here. If we take this one, we should add more tickets in 0.3.0 to clean up this code and simplify the control flow; I tried to do this with as little collateral damage as possible.

For a simpler fix, we could add a connection_ap_attach_pending() call to second_elapsed_callback().

Trac:
Status: new to needs_review
Keywords: regression deleted, regression 029-backport 028-backport added

Awesome. It builds for me and runs in a trivial test. The code looks plausible -- though I admit that there could be bugs in the way the patch calls the libevent stuff and I wouldn't know it.

I think you're right that the patch should ultimately make its way into 0.2.8 -- this is the current stable, and folks on crummy network connections are really hurting from this regression (if I were them I would be sticking to 0.2.7 at present, which is a poor idea for other reasons).

To be thorough, if this bug is what we think it is, then there are cases where the rest of the things in run_main_loop_once() never get called. Those are:

  /* All active linked conns should get their read events activated,
   * so that libevent knows to run their callbacks. */
  SMARTLIST_FOREACH(active_linked_connection_lst, connection_t *, conn,
                    event_active(conn->read_event, EV_READ, 1));
  called_loop_once = smartlist_len(active_linked_connection_lst) ? 1 : 0;

Does this one matter? It might. But maybe because it's only about linked conns, they will probably only happen when there's other activity too? Worth thinking more about before we brush it off.

  /* Make sure we know (about) what time it is. */
  update_approx_time(time(NULL));

No worries, this one happens in second_elapsed_callback() too.

  connection_ap_attach_pending(0);

which is what your patch is about.

That's all of them, right?

Activity