When starting TB 11.0.6 on Linux with self-defined bridges at bootstrapping following messages show up in log multiple times:
[WARN] Proxy Client: unable to connect OR connection (handshaking (proxy)) with [omitted] ID=[omitted] RSA_ID=[omitted] (“general SOCKS server failure”)
Bridges work after bootstrapping. Warnings irritate anyhow.
Steps to reproduce:
Set some bridge-lines in TB config and restart the browser
What is the current bug behavior?
Tor Logs show these warnings multiple times:
[WARN] Proxy Client: unable to connect OR connection (handshaking (proxy)) with [omitted] ID=[omitted] RSA_ID=[omitted] (“general SOCKS server failure”)
What is the expected behavior?
No warning like in the builds before
Environment
Tor Browser 11.0.6 downloaded via auto-update | Linux Debian
Started with ~/tor-browser_en-US/Browser/start-tor-browser
Relevant logs and/or screenshots
Multiple times:
[WARN] Proxy Client: unable to connect OR connection (handshaking (proxy)) with [omitted] ID=[omitted] RSA_ID=[omitted] (“general SOCKS server failure”)
For smallerRichard, there have been problems in the past with it being offline (tpo/anti-censorship/team#44 (closed)). It looks like it was online for the latter half of January but is offline again.
The bridges used are not the default bridges. They were manually set. The bridges are working after bootstrapping - so no real explanation for the warning - not even in the debug logs.
I can reproduce the "general SOCKS server failure" messages with a default bridge. You get a few such log messages, then the connection eventually works.
Feb 11 10:29:03.000 [notice] new bridge descriptor 'deusexmachina' (cached): $5B403DFE34F4872EB027059CECAE30B0C864B3A2~deusexmachina at 185.100.87.30Feb 11 10:29:03.000 [notice] Delaying directory fetches: Pluggable transport proxies still configuringFeb 11 10:29:04.000 [notice] Bootstrapped 5%: Connecting to directory serverFeb 11 10:29:04.000 [notice] Bootstrapped 10%: Finishing handshake with directory serverFeb 11 10:29:04.000 [notice] Bootstrapped 45%: Asking for relay descriptorsFeb 11 10:29:05.000 [warn] Proxy Client: unable to connect to 185.100.87.30:443 ("general SOCKS server failure")Feb 11 10:29:05.000 [notice] Delaying directory fetches: No running bridgesFeb 11 10:29:05.000 [warn] Proxy Client: unable to connect to 185.100.87.30:443 ("general SOCKS server failure")Feb 11 10:29:14.000 [notice] Bootstrapped 55%: Loading relay descriptorsFeb 11 10:29:18.000 [notice] Bootstrapped 61%: Loading relay descriptorsFeb 11 10:29:30.000 [notice] Bootstrapped 69%: Loading relay descriptorsFeb 11 10:29:31.000 [notice] Bootstrapped 74%: Loading relay descriptorsFeb 11 10:29:31.000 [notice] Bootstrapped 80%: Connecting to the Tor networkFeb 11 10:29:31.000 [notice] Bootstrapped 90%: Establishing a Tor circuitFeb 11 10:29:32.000 [notice] Bootstrapped 100%: Done
The obfs4proxy log shows that the underlying cause is an authentication failure at the obfs4 layer.
I suppose it is related to the Elligator2 changes in obfs4proxy-0.0.12 (tor-browser-build#40416 (closed)), specifically commit 393aca86cc. I get the log messages with that commit but not with its parent.
torrc:
DataDirectory datadir.clientSOCKSPort 9250UseBridges 1# Logs are in datadir.client/pt_state/obfs4proxy.logClientTransportPlugin obfs4 exec ./obfs4proxy -enableLogging -unsafeLogging -logLevel DEBUGBridge obfs4 185.100.87.30:443 5B403DFE34F4872EB027059CECAE30B0C864B3A2 cert=bWUdFUe8io9U6JkSLoGAvSAUDcB779/shovCYmYAQb/pW/iEAMZtO/lCd94OokOF909TPA iat-mode=2
We see the same problem in the Tails automated test suite, which made us hold back on the upgrade to obfs4proxy 0.0.12 so far. Initially we thought it could be a problem with our Chutney setup, so we improved our test suite machinery to extract more relevant information. We can now confirm we see the same ntor AUTH mismatch error, but tor eventually bootstraps (it can take a while though). We can reproduce this both with Chutney and on the real Tor network.
This particularly affects Tails because:
If after 10s Tor did not reach bootstrap-phase=BOOTSTRAP_STATUS_CONN_DONE, we assume it's blocked or for whatever reason won't manage to connect, and we retry with obfs4 (if we were connecting directly previously) or offer troubleshooting options to the user. This allowed us to make our UX much better.
This issue makes bootstrap slower with obfs4 bridges. In order to allow bootstrapping to complete, we would have to bump our 10s timeout significantly, which essentially reverts to our previous, significantly worse, UX: in variety of common situations, we would keep trying for a while something that cannot possibly work, before we offer other options to the user.
@intrigeri: note that as the default obfs4 bridges and other bridges upgrade, the situation is at some point going to flip for Tails, where if you don't have the new obfs4 client, you will begin experiencing these compatibility issues more and more frequently.
@intrigeri: note that as the default obfs4 bridges and other bridges upgrade, the situation is at some point going to flip for Tails, where if you don't have the new obfs4 client, you will begin experiencing these compatibility issues more and more frequently.
Thanks a lot! It was not clear to me (or I forgot) that "new
obfs4proxy server-side + old obfs4proxy client-side" would cause
compatibility issues so this is a very useful reminder to me!
This issue is actually likely to be fairly complicated and it will require some investigation to fix properly. The team is working on it but it will take some time.
In helping a user on #tor just now, I realized: this bug combines poorly with tpo/core/tor#40396 (closed) -- because if one of your first two obfs4 bridges has a reachability compatibility issue and doesn't connect, you'll mark it down, and then wedge yourself in a state where you won't try the other one.
So, the fact that stable tor browser still doesn't have a fix for 40396 makes this new bug more unfortunate. Or said in a more optimistic way, people who are being bitten by this bug might want to switch to Tor Browser alpha.
Right, I mean bug 40804 (this ticket we're on here) is worse in the stable, because the stable has another bug (40396) which makes this one worse. That other bug isn't present in the alpha.
Ok, I built the new obfs4proxy on the client side, and connected to an old obfs4 bridge, and experienced this bug myself.
Fortunately, Tor has a built-in retry mechanism when it wants to get a fresh bridge descriptor: when the connection to the bridge fails, it tries again very soon after:
Feb 23 08:30:08.552 [info] entry_guards_note_guard_failure(): Recorded failure for primary confirmed guard $F80C186BBDDDDC1DF6ECD8D654025590B187BF4E ($F80C186BBDDDDC1DF6ECD8D654025590B187BF4E)Feb 23 08:30:08.552 [info] connection_dir_client_request_failed(): Giving up on serverdesc/extrainfo fetch from directory server at 185.177.207.145:8443; retrying[...]Feb 23 08:30:09.313 [debug] download_status_log_helper(): 185.177.207.145 attempted 2 time(s); I'll try again in 1 seconds.[...]Feb 23 08:30:09.317 [info] circuit_handle_first_hop(): Next router is $F80C186BBDDDDC1DF6ECD8D654025590B187BF4E~bridge18 [IP77RXiPiseVvHsp6nfPmFaJ7+eIU+aMJRuceXAI5o8] at 185.177.207.145: Not connected. Connecting.
So: I think there will still be edge cases where this bug probabilistically bites us -- for example, if we manage to succeed at a connection at first, to fetch the descriptor, but then later we get disconnected and try to reconnect and we happen to fail, then we will mark that bridge down until we decide to retry it. But I think even in this case, Tor's failsafe logic of "if you think all of your entry points are down, and you get a new socks request in, then mark them up and give it another go" should let us re-attempt the connection.
In summary: are there situations where this bug actually puts us in a broken situation (as opposed to a situation where we simply have scary warning messages in our logs, or where one of Tor's failsafes needs to kick in to rescue us)?
And if yes we can end up in a broken situation, my next question will be whether that's actually tpo/core/tor#40396 (closed) and we've fixed it already in the alpha.
Fortunately, Tor has a built-in retry mechanism when it wants to get a fresh bridge descriptor: when the connection to the bridge fails, it tries again very soon after:
When I tried it with one bridge, the difficulty I encountered was that tor would wait longer and longer between attempts. (Perhaps exponentially increasing, driven by download_status_schedule_get_delay?) One or two failures is fine, but if you fail the dice roll a few times, it starts to take a long time. If I recall, after around 10 failures, tor was waiting more than 60 seconds between attempts. The distribution of total waiting time looks something like a geometric distribution multiplied by an exponential function. (The number of attempts it takes to get a working connection × the amount of time you've had to wait to make that many attempts.)
The increasing delays are probably not a problem with the default bridges, since there are many of them, and the probability that they all fail their first or second attempt is low. People who use one private bridge are more likely to encounter long delays.
Roger Dingledinechanged title from Proxy Client: unable to connect OR connection | warings when starting with bridges to Tor Browser's new obfs4proxy client has compatibility issues with old obfs4proxy bridges
changed title from Proxy Client: unable to connect OR connection | warings when starting with bridges to Tor Browser's new obfs4proxy client has compatibility issues with old obfs4proxy bridges
So, is it safe/wise to use bridges now with all of the errors generated when doing so, or should we switch to the regular way to connect and not use bridges until this has been fixed?
I guess safety is a relative question. This bug was introduced by a security fix, so we believe the obfs4 reconnection is not worst for your security than the previous status, and using bridges should be safe for most people. It just take longer to connect.
By using tor the regular way I guess you mean that you don't need bridges to avoid tor blockades but because of other reasons. If so I'm not sure what to say about safety. Bridges are not designed as a security mechanism, but anti-censorship tool. Bridges doesn't add much safety on top of normal tor, they might make harder to see tor being used but you are loosing the protection given by guard nodes. Anyway, this is unrelated to the problem we are discussing here and it will be better as a conversation for other places like the forum.
As we are asking default bridges to update, and I'm about to release a docker image with the newest version of obfs4. I have tested to use the old client to connect to a bridge running the new version and I see the same problem, it fails with the same warning and try to reconnect until it succeed.
I have a good understanding now of the cause of this interoperability problem.
In short, obfs4proxy-0.0.12 changed the interpretation of bit 254
(counting from 0) of Elligator-encoded public key representatives.
(The first 32 bytes sent in either direction, called
called X′ and
Y′
in the spec.)
Before version 0.0.12, obfs4proxy interpreted bit 254 as part of the public key representative.
After version 0.0.12, obfs4proxy ignores bit 254, as if it were set to 0.
There will be an authentication failure between
an old obfs4proxy and a new obfs4proxy whenever either side
has bit 254 of its public key representative set to 1:
the peers have different ideas of what public keys they are using.
The connection succeeds only when both sides have bit 254 equal to 0.
Because the bits are 0 or 1 independently with probability 1/2,
a connection failure occurs with probability 3/4 per attempt.
obfs4proxy-0.0.12 had a good reason to change the interpretation of public key representatives.
The final step of Elligator encoding (the inverse map) is to take a square root
in the finite field.
However, there are two square roots to choose from,
a positive one and a negative one.
You are supposed to always take the positive square root,
which, because it is positive, has its most significant bits equal to 0.
Then you are supposed to separately randomize the most significant bits,
which include bit 254.
The agl/ed25519 package
that was formerly used by obfs4proxy for Elligator encoding/decoding had a bug:
it did not always take the positive square root,
but used the positive or negative root consistently for any given input.
This created a correlation between bit 254 and the lower-order bits,
which made public key representatives distinguishable from random.
Background on this noncanonical square root issue:
Define √a as |b| if b2 = a and as |b√−1| otherwise. Here |b| means b if b ∈ {0, 1, …, (q − 1)/2}, otherwise −b.
The cross-version failure rate can be reduced from 3/4 to 1/2,
but I do not think it can be completely eliminated.
In the case where the client's public key representative has
bit 254 equal to 0, and the server's representative has bit 254 equal to 1,
the client could internally try both possible interpretations
of the server's public key (i.e., the old 255-bit interpretation and the new 254-bit interpretation)
and let the
ntor authentication check
pass if it works under either interpretation.
But when the client's public key representative has
bit 254 equal to 1, there's nothing we can do.
The client doesn't know the private key x
that corresponds to the server's interpretation of the public key,
so it cannot do the EXP(Y, x) and
EXP(B, x) parts of the
ntor computation.
A consequence of the above is that you can cause the version compatibility error
to occur with probability 1, as a client, by always sending a representative
that has bit 254 set to 1.
See tpo/anti-censorship/team#91 (comment 2832610)
for a script that leverages this fact to detect pre-0.0.12 servers.