Investigate Snowflake proxy failures

Trac:
Parent Ticket: #19001 (moved)

added component::circumvention/snowflake owner::cohosh parent::19001 priority::high severity::normal status::needs-information type::defect labels

Trac:
Parent: N/A to #19001 (moved)

Trac:
stun.lua

I wrote the attached stun.lua script to parse pcap files collected from some old snowflake network health measurements from #32545 (moved).

These capture files were generated by trying to bootstrap a Tor connection through snowflake 100 times. Each time the broker will hand the client a different snowflake to connect through. The lua script attempts to figure out the ip address of the snowflake and records whether or not NAT punching succeeded.

For all of the snowflakes that the client fails to connect to, I noticed the following:

the client successfully receives an answer from the broker, meaning ICE candidate gathering succeeded at the snowflake
snowflakes always produce a non-local address. A geolocation of these IP addresses show they aren't necessarily in countries that practice censorship (I checked this after noticing we have stats that show snowflakes in e.g., China). In fact some of the failing snowflakes were in Germany, the US, and the UK.
the client successfully sent a Binding Request to the snowflake, but never receives a Binding Request from the snowflake or a Binding Success Response.

This is a bit suspicious. If it was a firewall issue at the snowflake proxy's end, I would expect their firewall to allow outgoing STUN Binding Request packets to the client, since presumably it already allowed outgoing STUN packets to the STUN server in order to perform the ICE candidate collection. If it was a firewall issue on the client side, I would expect all snowflakes to fail.

After restarting the snowflake network health tests from #32545 (moved), I'm noticing that ~50 out of 400 snowflakes are failing. 12.5% is pretty high.

I'm also finding that the IP addresses of snowflakes that fail always fail, once I go through the output of the attached stun.lua and do some manual inspection to remove false negatives. These false negatives happen because of traffic from old snowflake connections seeping into future packet captures.

We have a lot of unique snowflakes, but we're starting to see repeats of snowflakes by IP after 400 runs, and I have yet to come across a case where a snowflake fails in one use but passes in another. I have seen some snowflakes that have failed twice on different days.

This is encouraging in the sense that if we can come up with a test to check reachability, we can eliminate the problem (perhaps with #32938 (moved)).

My first thought here was to use a slight variation of #32938 (moved) to do a simpler probe that just tests the ability to open a datachannel. This could function similar to the probe test of the bridge added in #31391 (moved).

My hesitancy for relying on this in the same way that we rely on the bridge is that it adds another single point of failure. If for some reason this probe test stops working, we will lose all of our proxies. I suppose the same is true for the broker or the bridge: if any of these stop working, snowflake essentially stops. But this increases the attack surface a bit.

Obviously it's preferable to actually find out what is causing such a high failure rate in proxies and use that to inform what we do. Another thought I has was that if STUN is never successfully completing, the proxies will end up timing out (here and here) because the datachannel was never opened. We could do one of several things if a proxy reaches this state:

log debug information and encourage the owner through the UI to file a Tor ticket with the log messages so we can figure out what's going on,
keep track of how many times this happens, and if it always happens (the proxy sees no successful connections) disable the proxy and print out some debug messages,
do a probe test only when the datachannel fails to open to check whether the proxy can open a datachannel with the probe point.

These aren't necessarily mutually exclusive. Option (2) provides a vector for attack where an adversarial client can make a bunch of connections through proxies and simply never open a datachannel. Hopefully as long as honest client traffic is significantly higher than adversarial traffic, the proxy will see some successes and not trigger the disable condition. In any case, I think encouraging proxy owners to file a ticket if it happens too much is a good way to go here.

Again, these techniques will only help against honest proxies that are trying their best but aren't helping users. I think that's the case for at least most of the proxies here because of the geographic distribution of failed snowflakes.

Here's a patch that implements a variation of option (2) above. If the proxy fails to open a datachannel more than a threshold number of times since the last success, it is disabled with a new missingFeatures message.

https://github.com/cohosh/snowflake/pull/25

Let's get some feedback on this idea before moving further.

I tested this by applying the following patch to the client:

diff --git a/client/lib/rendezvous.go b/client/lib/rendezvous.go
index 1f98e26..330c90a 100644
--- a/client/lib/rendezvous.go
+++ b/client/lib/rendezvous.go
@@ -11,6 +11,7 @@ package lib
 import (
        "bytes"
        "errors"
+       "fmt"
        "io"
        "io/ioutil"
        "log"
@@ -119,8 +120,8 @@ func (bc *BrokerChannel) Negotiate(offer *webrtc.SessionDescription) (
                if nil != err {
                        return nil, err
                }
-               answer := util.DeserializeSessionDescription(string(body))
-               return answer, nil
+               util.DeserializeSessionDescription(string(body))
+               return nil, fmt.Errorf("Dummy error")
        case http.StatusServiceUnavailable:
                return nil, errors.New(BrokerError503)
        case http.StatusBadRequest:

Trac:
Status: new to needs_review

Replying to cohosh:

After restarting the snowflake network health tests from #32545 (moved), I'm noticing that ~50 out of 400 snowflakes are failing. 12.5% is pretty high.

I'll run an experiment locally to try to characterize the nature of failures that I see. I'm pretty sure that for me, the proportion of failing proxies is more than 50%.

Trac:
snowflake-client-proxytest.patch

Patch to have snowflake-client emit a proxytest.csv with details about every connection attempt.

Trac:
proxytest-home.csv.gz

proxytest.csv output from running at home.

Trac:
proxytest-vps.csv.gz

proxytest.csv output from running on a VPS.

Trac:
proxytest.R

Script to do some analysis on proxytest.csv.

I attached some results of testing proxy failures at home and on a Linode VPS. Only 24.79% of ICE answers turned into a working proxy at home, versus 83.62% on the VPS.

Starting at commit 237fed11, apply snowflake-client-proxytest.patch. Run ./client -url https://snowflake-broker.azureedge.net/ -front ajax.aspnetcdn.com -ice stun:stun.l.google.com:19302 -max 1 (no tor needed). Run Rscript proxytest.R proxytest.csv. The outputs are in proxytest-home.csv.gz and proxytest-vps.csv.gz and proxytest.R does some basic analysis. The data are in "long" CSV format with one row per feature, but the script reshapes them into "wide" format with one row per proxy attempt and one column per feature. The id and attempt columns together define one broker interaction and proxy connection attempt. Attempts where the broker returned an answer have is.na(broker_err). Attempts that succeeded in opening a DataChannel have !is.na(ts_open). Locally I have the full offer/answer SDP strings but I didn't get a pcap.

The test falsifies a few hypotheses I had.

Hypothesis: I can only use proxies that have an IPv6 address. No, 9/29 successful attempts did not have an IPv6 address.
Hypothesis: I can only use proxies that send a nonzero address in the c= line. No, 16/29 successful attempts had 0.0.0.0 in the c= line.

Replying to cohosh:

log debug information and encourage the owner through the UI to file a Tor ticket with the log messages so we can figure out what's going on,

keep track of how many times this happens, and if it always happens (the proxy sees no successful connections) disable the proxy and print out some debug messages,

do a probe test only when the datachannel fails to open to check whether the proxy can open a datachannel with the probe point.

My opinion on this is that (2) is a reasonable idea. (I said (3) in the meeting today but I meant (2).)

It does open a new DoS vector: a malicious client can fail all its DataChannels and cause proxies to think they are unreliable.

comment:8 shows that failure rate may be as much a function of the client as of the proxy. Maybe this is a mutally incompatible NAT situation? The symptoms you mention in comment:2 match that. It's possible that both peers are sending binding requests to each other, but neither are making it all the way to the other side.

Replying to dcf:

Replying to cohosh:

log debug information and encourage the owner through the UI to file a Tor ticket with the log messages so we can figure out what's going on,

keep track of how many times this happens, and if it always happens (the proxy sees no successful connections) disable the proxy and print out some debug messages,

do a probe test only when the datachannel fails to open to check whether the proxy can open a datachannel with the probe point.

My opinion on this is that (2) is a reasonable idea. (I said (3) in the meeting today but I meant (2).)

It does open a new DoS vector: a malicious client can fail all its DataChannels and cause proxies to think they are unreliable.

comment:8 shows that failure rate may be as much a function of the client as of the proxy. Maybe this is a mutally incompatible NAT situation? The symptoms you mention in comment:2 match that. It's possible that both peers are sending binding requests to each other, but neither are making it all the way to the other side. Huh. This is a really good find. I was doing my tests on a VPS and my failure rate matches what your VPS failure rate was. I had no idea the NAT topologies of the client and proxy should have anything to do with each other.

Now I'm interested in whether the proxies that fail for a VPS are a subset of the proxies that fail for the home setup. If that's true, then I still think we should move forward with some variation of option (2). If not, then it doesn't seem to be the fault of the proxies and disabling them completely just because they get a lot of home connections might not be the right way to go. Although that is the typical use case. Of course the best thing to do is further track down what's happening here and find a way to make these proxies useful to more clients.

Replying to cohosh:

Now I'm interested in whether the proxies that fail for a VPS are a subset of the proxies that fail for the home setup.

I've sent you an email with the full offer and answer SDP in case you want to dig into the IP addresses.

Replying to dcf:

Replying to cohosh:

Now I'm interested in whether the proxies that fail for a VPS are a subset of the proxies that fail for the home setup.

I've sent you an email with the full offer and answer SDP in case you want to dig into the IP addresses. Thanks! I'll take a look at that today :)

A weird thing I noticed: the offer SDP has two copies of each candidate line, and two a=end-of-candidates?

I tracked this down and it turns out it was fixed recently. There hasn't been a new version tagged with this fix but we should update once there is.

Okay here's a summary of what I've found:

Tl;DR: I think you have a symmetric NAT setup at home, and anyone who does is going to have a lot of difficulty communicating with peers that have more restrictive NATs.

Analyzing the test SDPs

I ran your tests on my own home setup and found a success rate of 78%, which matches the success rate I got from my VPS set up today.
There's nothing different from the candidates as far as I can tell, although I think there still may be some unrelated bugs in the pion-webrtc ICE gathering in addition to the one in comment:13.

Background on NAT topology

There are several different kinds of NATs and each kind has some variance possible due to different implementations:

Full Cone NAT (port forwarding): an internal IP:port is mapped to a fixed external IP:port and any outside party can send a packet to the internal IP:port by knowing the external address
Restricted Cone NAT: same as above, but an outgoing packet from the internal address to the outside party's address must be sent first
Port Restricted Cone NAT: same as above, but an outgoing packet from the internal address to the outside party's IP:port must be sent first
Symmetric NAT: the external IP:port of an outgoing packet depends not only on the internal IP:port but also the destination address.

All but the symmetric NAT should work with STUN. That's because each party will send STUN connection request to the other party's candidates while also waiting to be contacted by the peer. This should satisfy the restricted cone NAT's requirement of outgoing packets to allow the peer's connection requests to punch through.

I came across two different implementations of symmetric NATs:

Random mapping: each internal IP:port and destination address tuple are randomly assigned an external IP:port mapping
Progressive mapping: each IP:port and destination address tuple are assigned an external IP:port mapping that increments for each new tuple.

Something interesting about STUN

I was going through some .pcap files from connection requests and noticed something interesting:

The initial STUN binding request to the STUN server will return a success response with XOR-MAPPED-ADDRESS [myaddr]:54576 and send an offer sdp with candidate [myaddr] 54576 typ srflx raddr 0.0.0.0 rport 54576 generation 0`
I receive the remote peer candidates

a=candidate:foundation 2 udp 2130706431 [peeraddr] 60459 typ host generation 0
a=candidate:foundation 1 udp 1694498815 [peeraddr] 32932 typ srflx raddr 0.0.0.0 rport 32932 generation 0

My client sends STUN binding requests to:

[peeraddr]:60459 and [peeraddr]:32932

I get a binding request from the remote peer at [peeraddr]:60459
I get a several binding success response from the remote peer with the follow XOR-MAPPED-ADDRESS's:

[myaddr]:56373
[myaddr]:54576
[myaddr]:47605

Only [myaddr] 54576 appeared in my offer sdp.

Hypotheses and next steps

I'm guess the above STUN behaviour of successful binding responses with a new XOR-MAPPED-ADDRESS provides a way for symmetric NATs to work some of the time. Specifically, my guess is that if the peer isn't behind a NAT or is behind a full cone or restricted cone NAT, it will accept incoming connections from the symmetric NAT and send the success response with the proper new XOR-MAPPED-ADDRESS and use that port instead of the signalled candidate port.

If the peer is behind a port restricted cone NAT or a symmetric NAT, the binding requests from the symmetric NAT client can't get through. We might want to move on #25595 (moved) to verify this.

The easiest way to solve this issues is to configure a TURN server (ticket #25596 (moved)). I have doubts about how effective this will be from a censorship resistance standpoint, since it produces yet another more centralized set of IP addresses the censor can block.

Another thing we can do is try to restrict our proxies to ones behind less restrictive NATs. Option (2) above can be changed to try to diagnose the NAT topology and this information can be given to the broker. We could also perhaps have clients become aware of their NAT topology and request more or less permissive peers depending on what they need.

Trac:
Status: needs_review to assigned
Owner: N/A to cohosh

Okay I'm reassessing the ideas presented in comment:5 and I think now that we know NAT topologies are likely a large source of the issues here, there are some different options I'd like to consider. The main techniques are:

Option 1: Disable or have less useful proxies poll less often

This is essentially what was discussed above, where we decided that keeping track of how often a datachannel times out without opening is a good metric for figuring out how useful a proxy is, and that disabling it after a few subsequent failed attempts is a good way to go.

To map out the design space here, we can separate this into two parts: how we measure and report the usefulness of a proxy, and what we do with this information.

Measuring a proxy's usefulness

I see three main options here:

A. Have proxies self-report a metric like the number of datachannel timeouts mentioned above.

(+) This is very easy to implement and gives us a good idea of how many clients a proxy works with (-) This is prone to denial of service attacks. A proxy can self-report as good while not functioning properly, or an adversarial client can purposefully fail to open a datachannel causing an honest proxy to believe it isn't useful.

B. Give proxies long-term identifiers and have clients report to the broker the IDs of failed proxies the next time they poll

(+) We've already put a little bit of thought into this. It would require an implementation of #29260 (moved) and a modification of the client-broker protocol which shouldn't be too difficult (+) Here we could restrict the denial of service by an adversarial client based on IP address. A single client IP could be rate limited on reporting bad proxies and could only report on each proxy once. (+) Proxies don't have to be trusted here (-) This adds complexity to the system (-) There are still some denial of service attacks possible if we're not careful. We should take into account client successes as well as failures to ensure that proxies aren't rejoining with different IDs, and make sure honest client successes aren't drowned out by adversarial failure reports.

C. Have an external probe behind different NATs determine how useful a proxy is

This is essentially a modification of #32938 (moved).

(+) Denial of service attacks are harder (-) Still requires honest self-reporting or the implementation of long-term identifiers (#29260 (moved)) (-) Adds a lot more moving parts and single points of failure. What if this probe service goes down? How will we make sure we have a variety of NATs? Who is responsible for it?

What to do with less useful proxies

The drawback to completely disabling a proxy just because it's behind a more restrictive NAT is that we'll be throwing out proxies that could still be useful for other clients and disincentivizing people to participate. It would be frustrating to find that your proxy isn't useful even though you are able to use other WebRTC tools (even though these usually aren't P2P).

However, telling proxies to poll less frequently doesn't actually make them more useful. It just makes other fixes like multiplexing (#25723 (moved)) more likely to have at least one more permissive/robust proxy.

Option 2: Distribute proxies to clients based on their compatibility with each other

I suggested this in comment:14 and while I like it in theory, it's difficult to do in practice, and we'd likely end up relying on heuristics similar to the datachannel timeouts in Option 1. It's possible that we could modify the STUN library to notice which candidates are chosen or what IP:port we're talking to in order to infer over multiple connections what kind of NAT topology we have but I suspect this is more difficulty than it's worth. Datachannel timeouts will likely give us a pretty good idea of what kind of NAT we have.

So, this option would be to take whatever measurement technique is best from Option 1 and also have clients measure their own success rate. These two measurements are then used together when the client polls the broker to get a proxy that's compatible for the client. If a client finds that most of their connections succeed, the broker can give them a proxy that works a lower percentage of the time. If a client typically has difficulty, the broker can give them a more permissive (i.e. higher success rate) proxy.

This requires more complex logic at the broker, an implementation of reliability measurements at the proxy and client, and a change in the protocol between the broker and these pieces. It doesn't seem too difficult though.

Option 3: Configure a TURN server to fall back on (#25596 (moved))

Maybe we want to do this anyway as a short term fix but as mentioned above I have my doubts that this can be a longer term solution.

Personally, I think we should go with Option 1 first and then decide if we want to layer Option 2 on top of it to make less permissive proxies more useful again. I'd also suggest going with option A first since it's the easiest and then seriously consider option B for measuring a proxy's usefulness since I think that will protect us more against denial of service attacks in the long run.

I'd prefer to have the less reliable proxies poll less often at the moment instead of completely disabling them, since that will cause people to get frustrated and drop out of participating even though they still provide some value. That means moving on #25598 (moved).

Okay this implements option 1.B by counting the number of successive failures. It slows the poll rate of the proxy if the failures pass the first threshold (5 in a row), and disables the proxy if it fails 15 times in a row. If the proxy succeeds, the fail count is reset and it goes back to polling at the starting rate.

https://github.com/cohosh/snowflake/pull/25

I'd like feedback on:

the plan to go with option 1.B
the values for the poll rates and thresholds
the method of using a fail count and resetting it. we can also keep track of a fail rate and reset it when we reset the stats.

Trac:
Status: assigned to needs_review

To me, the 1.B you suggested doesn't match the patch at https://github.com/cohosh/snowflake/pull/25. comment:16 makes it sound like 1.B is about clients reporting on proxies to the broker, and the broker enforcing the limit on proxies; but the pull request looks like the proxies noting their own failures and throttling themselves privately, not reporting the failure to anyone.

Nevertheless, I was going to suggest doing something like you've done in the pull request, so from my point of view it looks good.

Something to consider instead of discrete thresholds is a more analog polling frequency. Something like the additive increase/multiplicative decrease of TCP congestion avoidance, say. If a proxy has a failure, it multiplies its polling interval by a fixed percentage; if it has a success, it subtracts from its polling interval a fixed constant (down to some minimum).

Replying to dcf:

To me, the 1.B you suggested doesn't match the patch at https://github.com/cohosh/snowflake/pull/25. comment:16 makes it sound like 1.B is about clients reporting on proxies to the broker, and the broker enforcing the limit on proxies; but the pull request looks like the proxies noting their own failures and throttling themselves privately, not reporting the failure to anyone.

Whoops. You're right. I meant 1.A in both this comment and the one before it >.<

Something to consider instead of discrete thresholds is a more analog polling frequency. Something like the additive increase/multiplicative decrease of TCP congestion avoidance, say. If a proxy has a failure, it multiplies its polling interval by a fixed percentage; if it has a success, it subtracts from its polling interval a fixed constant (down to some minimum).

Ah. I like this better actually. The problem with thresholds is that if a proxy has a restrictive NAT, then once it succeeds it will start polling frequently again right away. With additive or multiplicative increases, we also don't need to worry about disabling proxies alltogether just yet.

I'll work on this and also rip out the disable code for now.

Trac:
Status: needs_review to needs_revision

Alright here's a new PR that uses an additive adjustment to the poll interval: https://github.com/cohosh/snowflake/pull/27

Trac:
Status: needs_revision to needs_review

Replying to cohosh:

Alright here's a new PR that uses an additive adjustment to the poll interval: https://github.com/cohosh/snowflake/pull/27

I left two questions on the PR. Other than that, the code looks good!

Trac:
Status: needs_review to needs_information

Okay I just removed the one unused variable.

I also realized I opened this PR for the wrong repo. Here's a new one: https://github.com/cohosh/snowflake-webext/pull/3

Trac:
Status: needs_information to needs_review

Replying to cohosh:

Okay I just removed the one unused variable.

I also realized I opened this PR for the wrong repo. Here's a new one: https://github.com/cohosh/snowflake-webext/pull/3

LGTM!

Trac:
Status: needs_review to merge_ready

Merged. Now to deploy a new version

Updated the webextension to v 0.3.0 for both Chrome and Firefox

I'll leave this open so we can see if this helps.

Trac:
Status: merge_ready to needs_information

mentioned in issue #33756 (moved)

mentioned in issue #33884 (moved)

mentioned in issue #34129 (moved)

moved to tpo/anti-censorship/pluggable-transports/snowflake#33666 (closed)

mentioned in issue tpo/anti-censorship/pluggable-transports/snowflake#34129 (closed)

Investigate Snowflake proxy failures

Child items ...

Activity

Analyzing the test SDPs

Background on NAT topology

Something interesting about STUN

Hypotheses and next steps

Option 1: Disable or have less useful proxies poll less often

Measuring a proxy's usefulness

What to do with less useful proxies

Option 2: Distribute proxies to clients based on their compatibility with each other

Option 3: Configure a TURN server to fall back on (#25596 (moved))