Tor Browser's new obfs4proxy client has compatibility issues with old obfs4proxy bridges

changed the description

The bridges that are giving this error are the public bridges (https://gitlab.torproject.org/tpo/anti-censorship/team/-/wikis/Default-Bridges) or a private bridge?

Do you know if this bridge is online and running? You can check this by pasting the fingerprint here: https://metrics.torproject.org/rs.html#search

Bridges are public and running. It's just a warning in the log, the connection works after bootstrapping. See also https://forum.torproject.net/t/bridges-work-but-with-error-messages-in-log-tb-11-0-6-linux/2112

A number of the default bridges are currently offline. If the log messages refer to these addresses / fingerprints, the log messages are expected.

fingerprint	address	nickname
FB70B257C162BF1038CA669D568D76F5B7F0BABB	144.217.20.138:80	smallerRichard
2D82C2E354D531A68469ADF7F878FA6060C6BACA	193.11.166.194:27015	KauBridgePale
86AC7B8D430DAC4117E9F42C9EAED18133863AAF	193.11.166.194:27020	KauBridgeBlue
1AE2C08904527FEA90C4C4F8C1083EA59FBC6FAF	193.11.166.194:27025	KauBridgeDot

For smallerRichard, there have been problems in the past with it being offline (tpo/anti-censorship/team#44 (closed)). It looks like it was online for the latter half of January but is offline again.

Created a new issue for the KauBridge* bridges: tpo/anti-censorship/team#64 (closed).

The bridges used are not the default bridges. They were manually set. The bridges are working after bootstrapping - so no real explanation for the warning - not even in the debug logs.

Sorry, my mistake.

mentioned in issue tpo/anti-censorship/team#64 (closed)

I can reproduce the "general SOCKS server failure" messages with a default bridge. You get a few such log messages, then the connection eventually works.

Feb 11 10:29:03.000 [notice] new bridge descriptor 'deusexmachina' (cached): $5B403DFE34F4872EB027059CECAE30B0C864B3A2~deusexmachina at 185.100.87.30
Feb 11 10:29:03.000 [notice] Delaying directory fetches: Pluggable transport proxies still configuring
Feb 11 10:29:04.000 [notice] Bootstrapped 5%: Connecting to directory server
Feb 11 10:29:04.000 [notice] Bootstrapped 10%: Finishing handshake with directory server
Feb 11 10:29:04.000 [notice] Bootstrapped 45%: Asking for relay descriptors
Feb 11 10:29:05.000 [warn] Proxy Client: unable to connect to 185.100.87.30:443 ("general SOCKS server failure")
Feb 11 10:29:05.000 [notice] Delaying directory fetches: No running bridges
Feb 11 10:29:05.000 [warn] Proxy Client: unable to connect to 185.100.87.30:443 ("general SOCKS server failure")
Feb 11 10:29:14.000 [notice] Bootstrapped 55%: Loading relay descriptors
Feb 11 10:29:18.000 [notice] Bootstrapped 61%: Loading relay descriptors
Feb 11 10:29:30.000 [notice] Bootstrapped 69%: Loading relay descriptors
Feb 11 10:29:31.000 [notice] Bootstrapped 74%: Loading relay descriptors
Feb 11 10:29:31.000 [notice] Bootstrapped 80%: Connecting to the Tor network
Feb 11 10:29:31.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
Feb 11 10:29:32.000 [notice] Bootstrapped 100%: Done

The obfs4proxy log shows that the underlying cause is an authentication failure at the obfs4 layer.

2022/02/11 10:29:03 [INFO]: obfs4proxy - accepting connections
2022/02/11 10:29:05 [ERROR]: obfs4(185.100.87.30:443) - outgoing connection failed: handshake: ntor AUTH mismatch: Derived: a997ce5153fdc398ac5582019104b830c2d059e637540323e8c5ccb8df2a886f Received:5d9aba675ed6d0e04249b5f78f81ae24b08b97f73c3cb2d8192e6f97157e8cba.
2022/02/11 10:29:05 [ERROR]: obfs4(185.100.87.30:443) - outgoing connection failed: handshake: ntor AUTH mismatch: Derived: a1ce209d96d241a7b55264fd2c6e91aa233abfbab70f035c98698bf29257ef91 Received:99f57cdf411517e265e11df0afd8270564b25ce93e66110875061ed2725bff2e.

I suppose it is related to the Elligator2 changes in obfs4proxy-0.0.12 (tor-browser-build#40416 (closed)), specifically commit 393aca86cc. I get the log messages with that commit but not with its parent.

torrc:

DataDirectory datadir.client
SOCKSPort 9250
UseBridges 1
# Logs are in datadir.client/pt_state/obfs4proxy.log
ClientTransportPlugin obfs4 exec ./obfs4proxy -enableLogging -unsafeLogging -logLevel DEBUG
Bridge obfs4 185.100.87.30:443 5B403DFE34F4872EB027059CECAE30B0C864B3A2 cert=bWUdFUe8io9U6JkSLoGAvSAUDcB779/shovCYmYAQb/pW/iEAMZtO/lCd94OokOF909TPA iat-mode=2

We see the same problem in the Tails automated test suite, which made us hold back on the upgrade to obfs4proxy 0.0.12 so far. Initially we thought it could be a problem with our Chutney setup, so we improved our test suite machinery to extract more relevant information. We can now confirm we see the same ntor AUTH mismatch error, but tor eventually bootstraps (it can take a while though). We can reproduce this both with Chutney and on the real Tor network.

This particularly affects Tails because:

If after 10s Tor did not reach bootstrap-phase=BOOTSTRAP_STATUS_CONN_DONE, we assume it's blocked or for whatever reason won't manage to connect, and we retry with obfs4 (if we were connecting directly previously) or offer troubleshooting options to the user. This allowed us to make our UX much better.
This issue makes bootstrap slower with obfs4 bridges. In order to allow bootstrapping to complete, we would have to bump our 10s timeout significantly, which essentially reverts to our previous, significantly worse, UX: in variety of common situations, we would keep trying for a while something that cannot possibly work, before we offer other options to the user.

It looks like this test suite was very useful for catching the problem. Is your setup public and if so, can you point towards it?

It looks like this test suite was very useful for catching the problem. Is your setup public and if so, can you point me towards it?

How one can run our test suite themselves is public: https://tails.boum.org/contribute/release_process/test/automated_tests/

But our production CI infrastructure is not public at the moment.

@intrigeri: note that as the default obfs4 bridges and other bridges upgrade, the situation is at some point going to flip for Tails, where if you don't have the new obfs4 client, you will begin experiencing these compatibility issues more and more frequently.

(I don't know where we are on that progress. I opened tpo/anti-censorship/bridgestrap#32 for a roadmap of how to get a better view.)

@intrigeri: note that as the default obfs4 bridges and other bridges upgrade, the situation is at some point going to flip for Tails, where if you don't have the new obfs4 client, you will begin experiencing these compatibility issues more and more frequently.

Thanks a lot! It was not clear to me (or I forgot) that "new obfs4proxy server-side + old obfs4proxy client-side" would cause compatibility issues so this is a very useful reminder to me!

Hi,

Is anyone monitoring the progress of updating the default bridges?

I understand tpo/anti-censorship/bridgestrap#32 may yield a clear answer for future incompatibilities, but not for this one at the moment.

We had contacted all the builtin bridges and most of them had updated (tpo/anti-censorship/team#67 (closed)). AFAIK there is only one missing to update that is offline and unresposive (tpo/anti-censorship/team#44 (closed)).

We had contacted all the builtin bridges and most of them had updated (tpo/anti-censorship/team#67 (closed)). AFAIK there is only one missing to update that is offline and unresposive (tpo/anti-censorship/team#44 (closed)).

Thanks a lot meskio!

I continue to experience this issue as well.

added Anti-Censorship Bug Next labels

[NOTICE]: obfs4proxy-0.0.12 - launched [ERROR]: obfs4(xx.xx.xx.xx:xxxx) - outgoing connection failed: handshake: ntor AUTH mismatch: Derived: a07a4f6713a7f073a8b12ec36164195fcc6cbd1809ad7d5cff40eb499733b75b Received:0bd90449feccf852c615359c77aef19f4da73bc69cbe92acd799cae750f4feea

Is there any troubleshooting or anything which can be done to aid in the slaying of this bug?

This issue is actually likely to be fairly complicated and it will require some investigation to fix properly. The team is working on it but it will take some time.

We discussed it in the team meeting today, here you can see the summary notes: https://forum.torproject.net/t/tor-project-anti-censorship-team-meeting-notes-2022-02-17/2212

Log of the discussion: http://meetbot.debian.net/tor-meeting/2022/tor-meeting.2022-02-17-15.59.log.html#l-8

Non reproducible with new obfs4proxy (0.0.12) server-side

server_obfs4proxy="path_to_old_or_new/obfs4proxy"
client_obfs4proxy="path_to_new_or_old/obfs4proxy"

TOR_PT_MANAGED_TRANSPORT_VER=1 TOR_PT_SERVER_TRANSPORTS=obfs4 TOR_PT_STATE_LOCATION=/tmp/obfs4_server_alone TOR_PT_SERVER_BINDADDR=obfs4-127.0.0.1:0 TOR_PT_ORPORT=127.0.0.1:0 "$server_obfs4proxy" >/tmp/out_pt_server.txt &
sleep 2
server_pt=`echo $!`
args=`cat /tmp/out_pt_server.txt | awk 'NF {if($1=="SMETHOD"){sub(/ARGS:/,""); print $4}}'`
server_addr=`cat /tmp/out_pt_server.txt | awk 'NF {if($1=="SMETHOD"){print $3}}'`
rm /tmp/out_pt_server.txt

iatmode=`echo $args | awk -F, '{print $2}'`
cert=`echo $args | awk -F, '{print $1}'`


TOR_PT_MANAGED_TRANSPORT_VER=1 TOR_PT_CLIENT_TRANSPORTS=obfs4 TOR_PT_STATE_LOCATION=/tmp/obfs4_client_alone "$client_obfs4proxy" --enableLogging >/tmp/out_pt_client.txt &
sleep 2
client_pt=`echo $!`
pt_addr=`cat /tmp/out_pt_client.txt | awk 'NF {if($1=="CMETHOD"){print $4}}'`
rm /tmp/out_pt_client.txt

ret=0
max=100
while [ "$ret" -ne 7 -a "$max" -ge 0 ]
do
  curl -s https://"$server_addr" -x socks5h://"$cert"\;:"$iatmode"\;key=final/@"$pt_addr"
  ret="$?"
  max=`expr $max - 1`
done
grep "ntor AUTH mismatch" /tmp/obfs4_client_alone/obfs4proxy.log
rm /tmp/obfs4_client_alone/obfs4proxy.log

kill $client_pt
kill $server_pt

Feel free to make it smarter

In helping a user on #tor just now, I realized: this bug combines poorly with tpo/core/tor#40396 (closed) -- because if one of your first two obfs4 bridges has a reachability compatibility issue and doesn't connect, you'll mark it down, and then wedge yourself in a state where you won't try the other one.

So, the fact that stable tor browser still doesn't have a fix for 40396 makes this new bug more unfortunate. Or said in a more optimistic way, people who are being bitten by this bug might want to switch to Tor Browser alpha.

@arma,

I'm testing v.11.5a4 / Linux x64 / Debian Sid and the issue #40804 (closed) still exists.

Or did you just mean that the issue #40396 (closed) was fixed?

Right, I mean bug 40804 (this ticket we're on here) is worse in the stable, because the stable has another bug (40396) which makes this one worse. That other bug isn't present in the alpha.

Ok, I built the new obfs4proxy on the client side, and connected to an old obfs4 bridge, and experienced this bug myself.

Fortunately, Tor has a built-in retry mechanism when it wants to get a fresh bridge descriptor: when the connection to the bridge fails, it tries again very soon after:

Feb 23 08:30:08.552 [info] entry_guards_note_guard_failure(): Recorded failure for primary confirmed guard $F80C186BBDDDDC1DF6ECD8D654025590B187BF4E ($F80C186BBDDDDC1DF6ECD8D654025590B187BF4E)
Feb 23 08:30:08.552 [info] connection_dir_client_request_failed(): Giving up on serverdesc/extrainfo fetch from directory server at 185.177.207.145:8443; retrying
[...]
Feb 23 08:30:09.313 [debug] download_status_log_helper(): 185.177.207.145 attempted 2 time(s); I'll try again in 1 seconds.
[...]
Feb 23 08:30:09.317 [info] circuit_handle_first_hop(): Next router is $F80C186BBDDDDC1DF6ECD8D654025590B187BF4E~bridge18 [IP77RXiPiseVvHsp6nfPmFaJ7+eIU+aMJRuceXAI5o8] at 185.177.207.145: Not connected. Connecting.

So: I think there will still be edge cases where this bug probabilistically bites us -- for example, if we manage to succeed at a connection at first, to fetch the descriptor, but then later we get disconnected and try to reconnect and we happen to fail, then we will mark that bridge down until we decide to retry it. But I think even in this case, Tor's failsafe logic of "if you think all of your entry points are down, and you get a new socks request in, then mark them up and give it another go" should let us re-attempt the connection.

In summary: are there situations where this bug actually puts us in a broken situation (as opposed to a situation where we simply have scary warning messages in our logs, or where one of Tor's failsafes needs to kick in to rescue us)?

And if yes we can end up in a broken situation, my next question will be whether that's actually tpo/core/tor#40396 (closed) and we've fixed it already in the alpha.

Fortunately, Tor has a built-in retry mechanism when it wants to get a fresh bridge descriptor: when the connection to the bridge fails, it tries again very soon after:

When I tried it with one bridge, the difficulty I encountered was that tor would wait longer and longer between attempts. (Perhaps exponentially increasing, driven by download_status_schedule_get_delay?) One or two failures is fine, but if you fail the dice roll a few times, it starts to take a long time. If I recall, after around 10 failures, tor was waiting more than 60 seconds between attempts. The distribution of total waiting time looks something like a geometric distribution multiplied by an exponential function. (The number of attempts it takes to get a working connection × the amount of time you've had to wait to make that many attempts.)

The increasing delays are probably not a problem with the default bridges, since there are many of them, and the probability that they all fail their first or second attempt is low. People who use one private bridge are more likely to encounter long delays.

changed title from Proxy Client: unable to connect OR connection | warings when starting with bridges to Tor Browser's new obfs4proxy client has compatibility issues with old obfs4proxy bridges

(gave the ticket a new title which is more about what is going wrong rather than just the symptom)

So, is it safe/wise to use bridges now with all of the errors generated when doing so, or should we switch to the regular way to connect and not use bridges until this has been fixed?

I guess safety is a relative question. This bug was introduced by a security fix, so we believe the obfs4 reconnection is not worst for your security than the previous status, and using bridges should be safe for most people. It just take longer to connect.

By using tor the regular way I guess you mean that you don't need bridges to avoid tor blockades but because of other reasons. If so I'm not sure what to say about safety. Bridges are not designed as a security mechanism, but anti-censorship tool. Bridges doesn't add much safety on top of normal tor, they might make harder to see tor being used but you are loosing the protection given by guard nodes. Anyway, this is unrelated to the problem we are discussing here and it will be better as a conversation for other places like the forum.

mentioned in issue tpo/anti-censorship/docker-obfs4-bridge#12 (closed)

I am able to reproduce this bug on latest Tor Browser for Android 11.0.6 as well.

As we are asking default bridges to update, and I'm about to release a docker image with the newest version of obfs4. I have tested to use the old client to connect to a bridge running the new version and I see the same problem, it fails with the same warning and try to reconnect until it succeed.

@meskio the new Docker image fixed it for me

mentioned in issue tpo/anti-censorship/bridgestrap#32

mentioned in issue tpo/anti-censorship/team#69 (closed)

marked this issue as related to tpo/anti-censorship/bridgestrap#32

added Q2 label

added Backlog label and removed Next label

mentioned in issue tpo/tpa/team#40773 (closed)

added Q3 label and removed Q2 label

Looks like we're done here with the upgrade to 11.5, reopen if there are any further issues.

closed

mentioned in issue tpo/anti-censorship/team#91 (closed)

I have a good understanding now of the cause of this interoperability problem. In short, obfs4proxy-0.0.12 changed the interpretation of bit 254 (counting from 0) of Elligator-encoded public key representatives. (The first 32 bytes sent in either direction, called called X′ and Y′ in the spec.) Before version 0.0.12, obfs4proxy interpreted bit 254 as part of the public key representative. After version 0.0.12, obfs4proxy ignores bit 254, as if it were set to 0. There will be an authentication failure between an old obfs4proxy and a new obfs4proxy whenever either side has bit 254 of its public key representative set to 1: the peers have different ideas of what public keys they are using. The connection succeeds only when both sides have bit 254 equal to 0. Because the bits are 0 or 1 independently with probability 1/2, a connection failure occurs with probability 3/4 per attempt.

obfs4proxy-0.0.12 had a good reason to change the interpretation of public key representatives. The final step of Elligator encoding (the inverse map) is to take a square root in the finite field. However, there are two square roots to choose from, a positive one and a negative one. You are supposed to always take the positive square root, which, because it is positive, has its most significant bits equal to 0. Then you are supposed to separately randomize the most significant bits, which include bit 254. The agl/ed25519 package that was formerly used by obfs4proxy for Elligator encoding/decoding had a bug: it did not always take the positive square root, but used the positive or negative root consistently for any given input. This created a correlation between bit 254 and the lower-order bits, which made public key representatives distinguishable from random. Background on this noncanonical square root issue:

Yawning/libelligator: ScalarBaseMult returns 255-bit value
agl/ed25519: When calculating the Elligator 2 forward map, use -b if required
agl/ed25519: Incorrect (non canonical) representative output for ScalarBaseMult()
The Elligator paper, Section 5.5:

Define √a as |b| if b² = a and as |b√−1| otherwise. Here |b| means b if b ∈ {0, 1, …, (q − 1)/2}, otherwise −b.

The cross-version failure rate can be reduced from 3/4 to 1/2, but I do not think it can be completely eliminated. In the case where the client's public key representative has bit 254 equal to 0, and the server's representative has bit 254 equal to 1, the client could internally try both possible interpretations of the server's public key (i.e., the old 255-bit interpretation and the new 254-bit interpretation) and let the ntor authentication check pass if it works under either interpretation. But when the client's public key representative has bit 254 equal to 1, there's nothing we can do. The client doesn't know the private key x that corresponds to the server's interpretation of the public key, so it cannot do the EXP(Y, x) and EXP(B, x) parts of the ntor computation.

A consequence of the above is that you can cause the version compatibility error to occur with probability 1, as a client, by always sending a representative that has bit 254 set to 1. See tpo/anti-censorship/team#91 (comment 2832610) for a script that leverages this fact to detect pre-0.0.12 servers.

mentioned in issue tpo/anti-censorship/team#96 (closed)

mentioned in issue tpo/anti-censorship/connectivity-measurement/probeobserver#1 (closed)

marked this issue as related to tpo/anti-censorship/connectivity-measurement/probeobserver#1 (closed)

mentioned in issue #41438 (closed)

Tor Browser's new obfs4proxy client has compatibility issues with old obfs4proxy bridges

Summary

Steps to reproduce:

What is the current bug behavior?

What is the expected behavior?

Environment

Relevant logs and/or screenshots

Designs

Child items ...

Activity