ok, will wait until we know a) what platforms does this happen on (snow leopard only or at least 1 more), and b) whether it only happens with --enable-bufferevents.
Once that's known, the thing to find out would be how far it gets in checking its own orport. At what point does it fail, and why? Is its ORPort secretly working but not checking out as okay, or is it broken and correctly detected as broken? etc etc
This was first analyzed on Kubuntu by SwissTorExit, so not an OS X 10.6 exclusive problem.
It only happens when --enable-bufferevents is used during configure. Testing it in a private network doesn't reveal any problems (probably because AssumeReachable is set on both authorities and relays).
When trying to get a relay (with assumereachable set) into the public network, the authorities fail their side of the reachability test though, so the node never makes it.
Aug. So it seems that with bufferevents enabled, reachability testing never succeeds? The OR can't find itself to be reachable, and the authorities can't find it to be reachable either?
I was under the impression that when I tested in a private network, circuits were indeed built. Can you reproduce this on a platform where this occurs?
If that's right, this might mean that bufferevents-using tors can handshake with each other, but not with non-bufferevents tors. If that's so, I wonder how far they get before the handshake fails. If not, I wonder which part of the reachability testing goes wrong.
Trac: Status: new to accepted Summary: Reachability test for ORPort doesn't complete to Reachability test for ORPort doesn't complete with --enable-bufferevents set Owner: N/Ato nickm
If incoming are failing and outgoing are suceeding, I suspect that this is one of those openssls that needs its "yes let me renegotiate!" magic patched in, and that it needs said magic patched in repeatedly since it likes to forget it. are we failing at the step where the client tries to renegotiate?
Another thing that would help answering this is, are these systems that patched or upgraded OpenSSL to take SSL3_FLAGS approach to blocking renegotiation, or the SSL_OP approach? If they're both of the same kind, can we try with one that takes the other approach?
Looks like it doesn't matter. One was compiled with openssl 1.0.0a, the other one with the apple-supplied 0.9.8l (the latter one using SSL3_FLAGS, the former SSL_OP)
03:46 < Sebastian> So here's my first test, running a private network with relays and authorities configured with --enable-bufferevents, and a client that runs without. Both using current master HEAD.03:46 < Sebastian> client debug log available here: http://sebastianhahn.net/tor/clientdebug.log, relay debug here: http://sebastianhahn.net/tor/relaydebug.log03:47 < Sebastian> I used StrictNodes and EntryNodes to make sure this was the node that would be used.03:47 < Sebastian> 127.0.0.1:5001 is where the relay runs.03:50 < Sebastian> It does indeed look like the client fails to renegotiate.03:51 < Sebastian> nickm: trying your suggestion of old openssl now03:52 < Sebastian> nickm: if you look at the relay log, there are weird debug log lines too, though. Looks like the bufferevents-enabled network itself has issues
So on analysis, it looks to me and Sebastian as if the relay gets up to "waiting for renegotiation" then stalls, and the client gets up to "connection_tls_continue_handshake(): wanted read", and that's it. Also, it seems nothing in Sebastian's tests managed to connect to the relays.
I've added a patch in my public repository under branch "loud_ssl_states". It won't fix anything, but it will log (at debug) every SSL state change that occurs in every SSL object, and maybe help us pinpoint this a little better.
So for some reason, now I can reproduce the OpenSSL 1.0.0a case.
Here is what I can tell: renegotiation works and the client successfully calls connection_tls_finish_handshake. It completes, and the client sends a VERSIONS cell. Go client!
The server gets to connection_or_handle_events_cb, and makes it to the case commented as "improved handshake, but not a client". However, for some reason it does NOT call connection_or_tls_renegotiated_cb! I am guessing that the renegotiation has already finished by the time that we reach connection_or_handle_events_cb on the server side!
Compare the client's
Oct 11 12:25:32.045 [debug] connection_tls_finish_handshake(): clienttls handshake (0x1e92d60) with [scrubbed] done. verifying.
with the server's
Oct 11 12:27:33.968 [debug] tor_tls_finish_handshake(): Completed V2TLS handshake with client; waiting for renegotiation.
Okay, there were at least 5 bugs here. :p I think I have them all fixed now, as of fbacbf9f and a9172c87. At least, the relays on my test network now manage to build circuits with the offending openssl versions, which is pretty nice of them. Fixing that, though, turned up some other bugs too.
Note that you will need Libevent 2.0.8-rc, which will be released within a week--I hope! You can also use Libevent git master, 34d64f8a347147aa or later.
Actually, it's possible that server-side tunneled connections might fail, since relays in my private network manage to bootstrap , but the client doesn't.
Also, the client log is full of
Oct 12 14:51:36.034 [debug] connection_or_process_cells_from_inbuf(): 6: starting, inbuf_datalen 512 (0 pending in tls object).Oct 12 14:51:36.034 [debug] connection_or_process_cells_from_inbuf(): 6: starting, inbuf_datalen 0 (0 pending in tls object).Oct 12 14:51:36.166 [debug] run_connection_housekeeping(): Sending keepalive to (127.0.0.1:3001)
Working theory on what's going on with the tunneled connections, it looks like a lot of data is arriving at the client, but for some reason, the client is never getting the EOF.
Some extra debugs I found show that clients get here on consensus downloads...
Oct 12 15:27:50.011 [debug] connection_dir_process_inbuf(): Got data, not eof. Leaving on inbuf. We have 1923 bytes waiting on 0x811dc0
whereas relays only get here:
Oct 12 15:27:49.943 [debug] connection_dir_client_reached_eof(): Got eof on 0x22de440 with 1923 bytes in inbuf
Also worth noting: this only happens on a private network. I can run a bufferevent-enabled client on its own, and it happily downloads directory stuff. So update the working theory to say that the directories aren't sending RELAY_COMMAND_END cells for some reason. This could be an exit bug, but this could also be a linked-connection bug.
Yes! I can confirm that the clients never logs the message "%d: end cell (%s) for stream %d. Removing stream." in this case. So for whatever reason, either no END cell is generated, or it isn't delivered to the client, or the client doesn't notice it. I am guessing that it is the former, based on how this appears in my testing net but not when using a client with the public net.
--
Also, it seems that when we re-create a new bufferevent_openssl object callback, we don't properly set its inbuf/outbuf callbacks to inform us when stuff is successfully flushed, so we never update lastwritten on an OR connection.
Good thing that one didn't make it into production! Fixed it in 5710d99f
Indeed, we aren't generating END cells on linked connections. The only way to get an BEV_EVENT_EOF out of a bufferevent_pair is to call bufferevent_flush on it, and the only way to get connection_edge_reached_eof() called with bufferevents is by generating a BEV_EVENT_EOF.
Arguably, libevent's bufferevent pairs should also generate a BEV_EVENT_EOF when BEV_OPT_CLOSE_ON_FREE is set on them, and you free the other one of them.
I noticed my bufferevents-enabled Tor client going into an infinite loop when used with the public Tor network. I was able to reproduce it once, but now can't reproduce it anymore. Sad :/
Before your changes, but I also stopped being able to reproduce it before I changed the binary at all. A while after the successfully bootstrapped message, Tor started using 100% cpu. No new notice log entries were created, and since I hadn't configured a torrc I couldn't just hup the process to change logging to debug :(
ok, i tested again 8 hours ago on Tor network and i got different Log message after starting Tor a couples times, OrPort now be reachable but after when i try open a circuit, it start to open at infinite different circuits because all failings to be done.
Here are the different log message at notice level only:
oct. 13 01:06:48.346 [Notice] Tor v0.2.3.0-alpha-dev (git-5710d99f). This is experimental software. Do not rely on it for strong anonymity. (Running on Linux x86_64)
oct. 13 01:06:48.346 [Notice] Initialized libevent version 2.0.7-rc-dev using method epoll. Good.
oct. 13 01:06:48.346 [Notice] Opening OR listener on 0.0.0.0:9090
oct. 13 01:06:48.346 [Notice] Opening Directory listener on 0.0.0.0:9091
oct. 13 01:06:48.346 [Notice] Opening Socks listener on 127.0.0.1:9050
oct. 13 01:06:48.347 [Notice] Opening DNS listener on 127.0.0.1:9053
oct. 13 01:06:48.347 [Notice] Opening Control listener on 127.0.0.1:9051
oct. 13 01:06:48.347 [Notice] Based on 775 circuit times, it looks like we don't need to wait so long for circuits to finish. We will now assume a circuit is too slow to use after waiting 19 seconds.
oct. 13 01:06:48.347 [Notice] Parsing GEOIP file.
oct. 13 01:06:50.809 [Notice] Configured to measure statistics. Look for the *-stats files that will first be written to the data directory in 24 hours from now.
oct. 13 01:06:50.809 [Notice] OpenSSL OpenSSL 0.9.8k 25 Mar 2009 [9080bf] looks like it's older than 0.9.8l, but some vendors have backported 0.9.8l's renegotiation code to earlier versions, and some have backported the code from 0.9.8m or 0.9.8n. I'll set both SSL3_FLAGS and SSL_OP just to be safe.
oct. 13 01:06:50.809 [Notice] Your Tor server's identity key fingerprint is 'SwissTorHelp A32E64A8136EBD5124048022786857CB931F584F'
oct. 13 01:06:50.809 [Notice] This version of Tor (0.2.3.0-alpha-dev) is newer than any recommended version, according to the directory authorities. Recommended versions are: 0.2.0.35,0.2.1.19,0.2.1.20,0.2.1.21,0.2.1.22,0.2.1.25,0.2.1.26,0.2.2.1-alpha,0.2.2.2-alpha,0.2.2.3-alpha,0.2.2.4-alpha,0.2.2.5-alpha,0.2.2.6-alpha,0.2.2.7-alpha,0.2.2.8-alpha,0.2.2.10-alpha,0.2.2.11-alpha,0.2.2.12-alpha,0.2.2.13-alpha,0.2.2.14-alpha,0.2.2.15-alpha,0.2.2.16-alpha,0.2.2.17-alpha
oct. 13 01:06:50.810 [Notice] Reloaded microdescriptor cache. Found 12437 descriptors.
oct. 13 01:06:50.810 [Notice] We now have enough directory information to build circuits.
oct. 13 01:06:50.810 [Notice] Bootstrapped 80%: Connecting to the Tor network.
oct. 13 01:06:50.810 [Notice] New control connection opened.
oct. 13 01:06:51.038 [Notice] Guessed our IP address as 80.218.145.226 (source: 208.83.223.34).
oct. 13 01:06:51.890 [Notice] Bootstrapped 85%: Finishing handshake with first hop.
oct. 13 01:06:54.032 [Notice] Bootstrapped 90%: Establishing a Tor circuit.
oct. 13 01:06:54.515 [Notice] Tor has successfully opened a circuit. Looks like client functionality is working.
oct. 13 01:06:54.516 [Notice] Bootstrapped 100%: Done.
oct. 13 01:06:54.516 [Notice] Now checking whether ORPort 80.218.145.226:80 and DirPort 80.218.145.226:443 are reachable... (this may take up to 20 minutes -- look for log messages indicating success)
oct. 13 01:06:54.790 [Warning] Received http status code 404 ("Not found") from server '91.208.34.27:80' while fetching consensus directory.
oct. 13 01:06:56.651 [Notice] Self-testing indicates your DirPort is reachable from the outside. Excellent.
oct. 13 01:06:56.892 [Warning] Received http status code 404 ("Not found") from server '91.208.34.18:80' while fetching consensus directory.
oct. 13 01:06:57.320 [Notice] Based on 777 circuit times, it looks like we don't need to wait so long for circuits to finish. We will now assume a circuit is too slow to use after waiting 18 seconds.
oct. 13 01:07:00.014 [Notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
oct. 13 01:07:51.833 [Warning] TLS error while handshaking (with bufferevent) with [scrubbed]: ssl handshake failure (in SSL routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A)
oct. 13 01:08:04.322 [Warning] TLS error while handshaking (with bufferevent) with [scrubbed]: ssl handshake failure (in SSL routines:SSL3_READ_BYTES:SSL3_ST_SR_CERT_A)
Other start case: oct. 13 01:14:47.721 [Notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
oct. 13 01:14:48.122 [Notice] Bootstrapped 85%: Finishing handshake with first hop.
oct. 13 01:16:17.330 [Warning] Problem bootstrapping. Stuck at 85%: Finishing handshake with first hop. (Connection timed out; TIMEOUT; count 10; recommendation warn)
another start :
oct. 13 01:19:10.751 [Notice] We now have enough directory information to build circuits.
oct. 13 01:19:10.751 [Notice] Bootstrapped 80%: Connecting to the Tor network.
oct. 13 01:19:10.751 [Notice] New control connection opened.
oct. 13 01:19:10.752 [Warning] TLS error while handshaking (with bufferevent) with [scrubbed]: http request (in SSL routines:SSL23_GET_CLIENT_HELLO:SSL23_ST_SR_CLNT_HELLO_A)
oct. 13 01:19:10.752 [Notice] Guessed our IP address as 80.218.145.226 (source: 193.23.244.244).
oct. 13 01:19:13.591 [Notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
oct. 13 01:19:18.803 [Notice] Bootstrapped 85%: Finishing handshake with first hop.
next one :
oct. 13 01:20:15.792 [Notice] Bootstrapped 80%: Connecting to the Tor network.
oct. 13 01:20:15.792 [Notice] New control connection opened.
oct. 13 01:20:16.689 [Notice] Bootstrapped 85%: Finishing handshake with first hop.
oct. 13 01:20:16.785 [Notice] Guessed our IP address as 80.218.145.226 (source: 213.115.239.118).
oct. 13 01:20:18.645 [Notice] Bootstrapped 90%: Establishing a Tor circuit.
oct. 13 01:20:19.889 [Notice] Tor has successfully opened a circuit. Looks like client functionality is working.
oct. 13 01:20:19.889 [Notice] Bootstrapped 100%: Done.
oct. 13 01:20:19.889 [Notice] Now checking whether ORPort 80.218.145.226:80 and DirPort 80.218.145.226:443 are reachable... (this may take up to 20 minutes -- look for log messages indicating success)
oct. 13 01:20:20.078 [Notice] Self-testing indicates your ORPort is reachable from the outside. Excellent. Publishing server descriptor.
oct. 13 01:20:21.615 [Notice] Self-testing indicates your DirPort is reachable from the outside. Excellent.
oct. 13 01:20:22.038 [Notice] Based on 794 circuit times, it looks like we don't need to wait so long for circuits to finish. We will now assume a circuit is too slow to use after waiting 20 seconds.
Look good but still never possible to reach te web, when try open a site like Tor Trac or OFTC irc I.E , all circuit fails...and so in infite...
my Torrc :
This file was generated by Tor; if you edit it, comments will not be preserved
The old torrc file was renamed to torrc.orig.1 or similar, and Tor will ignore it
Ug. On the BEV_EVENT_EOF issue, it seems that conn_close_if _marked is trying to doing the flush that should be necessary to make directory servers work. So, there's some bug there. Time for yet more investigation on that.
Now I get reproducible segfaults when using a bufferevent client with the normal network. I believe this is because some pieces of code still try to use connection_t.inbuf with bufferevents. Also the xxx in connection_add_impl() in main.c seem important.
Here's the backtrace:
#0 buf_datalen (buf=0x0) at buffers.c:518#1 0x000000010007fa75 in circuit_resume_edge_reading_helper (first_conn=0x10101aff0, circ=0x10101e820, layer_hint=0x1003e3ba0) at relay.c:1513#2 0x00000001000810db in connection_edge_process_relay_cell (cell=0x7fff5fbff1f0, circ=0x10101e820, conn=0x0, layer_hint=0x1003e3ba0) at relay.c:1232#3 0x0000000100081bea in circuit_receive_relay_cell (cell=0x7fff5fbff1f0, circ=0x10101e820, cell_direction=<value temporarily unavailable, due to optimizations>) at relay.c:223#4 0x0000000100019951 in command_process_relay_cell [inlined] () at /tor-git/tor/src/or/command.c:437#5 0x0000000100019951 in command_process_cell (cell=0x7fff5fbff1f0, conn=0x1010027d0) at command.c:158#6 0x0000000100038d11 in connection_or_process_inbuf (conn=0x1010027d0) at connection_or.c:1453#7 0x000000010002b3ed in connection_handle_read_cb (bufev=<value temporarily unavailable, due to optimizations>, arg=0x1010027d0) at connection.c:2821#8 0x00000001001b8d82 in bufferevent_run_deferred_callbacks_locked (_=0x7fff5fbff1f0, arg=<value temporarily unavailable, due to optimizations>) at bufferevent.c:145#9 0x00000001001b13c7 in event_process_deferred_callbacks [inlined] () at /git/libevent/event.c:1326#10 0x00000001001b13c7 in event_base_loop (base=0x100301000, flags=<value temporarily unavailable, due to optimizations>) at event.c:1365#11 0x000000010006b811 in do_main_loop () at main.c:1725#12 0x000000010006baee in tor_main (argc=1, argv=<value temporarily unavailable, due to optimizations>) at main.c:2402#13 0x00000001000011a4 in start ()
Err, hang on. The stack trace sebastian posted should be better now. There is still probably at least one lingering bug or two that should keep you from running it. It needs more testing on local networks first. :(
There are a couple more bugs to nail down. We no longer fall over when there's a traffic spike, but we need to figure out why there was a traffic spike at all, when maybe there shouldn't have been.
When that's done, for IOCP to be testable, we need to try out the filtering implementation of SSL bufferevents.
The filtering stuff is in. The traffic-spike code is maybe stuff we should debug first. Or we could just go ahead and try it out and hope that any bandwidth-hogging weirdness will get detected soon. Libevent 2.0.9 will be required; that should be out by thanksgiving. Building from libevent git master would also work.
i tested all last git origin/master and now it can build the 4 first circuit but can't build new connections to reach the web. It always failed to connect with any sites, irc ect...
It do a segfault at anytimes too when stop it but don't give any trace about the segfault.
I started this Tor as a client, and it succeeded in building a circuit.
I then configured it as an unpublished bridge using Vidalia, and its reachability test succeeded. A while later, I tried to browse the web through it, and I think it didn't work. (I may be wrong about the 'it didn't work' part; I wasn't trying to figure out or remember what was going on.)
I stopped Tor using Vidalia, and Tor crashed.
I started Tor again, and it was unable to bootstrap.
I configured it as a client only using Vidalia, and then stopped Tor. Tor crashed again.
I started Tor again, and it succeeded in building a circuit.
I encountered a possibly related problem with --enable-bufferevents.
In my environment it seems that bootstrapping never succeeds when buffer events
are used.
The behaviour is similar to what stars described.
Here are the results of a series of tests in which I altered
--enable-bufferevents, the data directory, and TOR_LIBEVENT_TICKS_PER_SECOND
found in src/common/compat_libevent.h.
For the last few I documented the settings and whether bootstrapping succeeded
to post it here:
clean out data directory, restart: stuck at "Bootstrapped 25%: Loading networkstatus consensus."
restart : stuck at "Bootstrapped 25%: Loading networkstatus consensus."
clean out data directory, no buffer events : bootstrapping succeeds
restart : bootstrapping succeeds
enable buffer events, 100 tick/s : stuck at "Bootstrapped 85%: Finishing handshake with first hop."
clean out data directory, restart: stuck at "Bootstrapped 10%: Loading networkstatus consensus."
no buffer events : bootstrapping succeeds
enable buffer events, 100 tick/s : stuck at "Bootstrapped 85%: Finishing handshake with first hop."
enable buffer events, 1 tick/s : stuck at "Bootstrapped 85%: Finishing handshake with first hop."
clear data directory, no buffer events : bootstrapping succeeds
enable buffer events, 1 tick/s : stuck at "Bootstrapped 90%: Establishing a Tor circuit."
Note that it gets stuck earlier if there is no data directory.
In all my testing there was only one situation in which bootstrapping succeeded
with buffer events enabled, which was in 2. above. All previous attempts failed
at some point.
My own observations so far are these:
When starting with an empty data directory, bootstrapping does not go beyond
25%.
When starting with a data directory created by a correctly working tor node
(i.e. one with buffer events disabled) bootstrapping got stuck between
80-90%. One time it even succeeded.
I also tried upgrading to a later version of OpenSSL and libevent, but it did
not have any effect. I still get the same behaviour with buffer events.
tor commit: d120ee1c
OpenSSL version: 0.9.8g (and later 0.9.8n)
libevent version: 2.0.10 stable
Platform: Debian GNU/Linux, running in a VM
uname -a: Linux oniontest 2.6.26-2-xen-686 #1 SMP Thu Jan 27 05:44:37 UTC 2011 i686 GNU/Linux
I come bearing gifts. I have two bundles (Tor expert package and Vidalia bundle) built with the latest libevent (2.0.13-stable) and Tor alpha. Please give them some basic testing and let me know what should be announced for wider consideration (e.g., on the blog, tor-talk, etc.) I am not going to announce anything that is clearly crashy for relays, but I will announce things that are occasionally crashy in new ways.
Okay, we NO LONGER WANT TESTING on the above bundles.
There are some bad bugs in them that we fixed; testing those is now entirely pointless. If you're testing them, please stop. More bundles forthcoming once I can debug more.
If you're building from source on unix or windows[*], please try out the very latest tor master branch with the very latest libevent patches-2.0 branch. I think they're working fairly nicely.
(To test with IOCP on Windows, first test as normal to see whether non-IOCP works. Then, run with "DisableIOCP 0".)
I have tried using it as a client on a few platforms, and as a bridge on windows with IOCP. It seemed to work in all those cases.
I believe that this specific bug is working and can be closed.
Sadly, some people have used this bug as a catch-all bug for "something screwy happened with bufferevents enabled" ; I am not sure that every issue mentioned here is currently solved. We should do a once-over to make sure that every new issue brought up in the comments is either fixed or has a ticket, and then close this.