Client's NAT checking is slow, so restricted proxies are almost unused
Clients assume NATUnknown before the NAT check is performed, but they send rendezvous request before the NAT check is done. The broker pairs NATUnknown clients with NATUnrestricted proxies. So the client almost always gets paired with an unrestricted proxy initially, regardless of its actual NAT type, and can only switch to a restricted proxy on further retries (e.g. if there were no proxies on first request, or if it failed to connect to the initial proxy).
This could result in unrestricted proxies getting overloaded with unrestricted clients, and restricted clients not being able to connect to Snowflake.
Here is a client log sample:
log
--- Starting Snowflake Client ---
Using ICE servers:
url: stun:stun.l.google.com:19302
url: stun:stun.sonetel.com:3478
url: stun:stun.antisip.com:3478
url: stun:stun.epygi.com:3478
url: stun:stun.voys.nl:3478
Rendezvous using Broker at: https://snowflake-broker.torproject.net/
---- SnowflakeConn: begin collecting snowflakes ---
---- SnowflakeConn: starting a new session ---
WebRTC: Collecting a new Snowflake. Currently at [0/1]
snowflake-... connecting...
redialing on same connection
---- SnowflakeConn: begin stream 3 ---
WebRTC: DataChannel created
WebRTC: Created offer
WebRTC: Set local description
Warning: NAT checking failed for server at stun.l.google.com:19302: NAT discovery feature not supported: attribute not found
Negotiating via HTTP rendezvous...
Target URL: snowflake-broker.torproject.net
HTTP rendezvous response: 200 OK
Received answer: {"answer":"{\"type\":\"answer\",\"sdp\":\"v=0...}"}
Received Answer.
WebRTC: DataChannel.OnOpen
---- Handler: snowflake assigned ----
Traffic Bytes (in|out): 25 | 148 -- (1 OnMessages, 6 Sends)
WebRTC: At capacity [1/1] Retrying...
Warning: NAT checking failed for server at stun.sonetel.com:3478: Error retrieveing server response: timed out waiting for response
Traffic Bytes (in|out): 58 | 58 -- (2 OnMessages, 2 Sends)
WebRTC: At capacity [1/1] Retrying...
Warning: NAT checking failed for server at stun.antisip.com:3478: Error retrieveing server response: timed out waiting for response
NAT Type: unrestricted
Traffic Bytes (in|out): 58 | 58 -- (2 OnMessages, 2 Sends)
You can see that we got a client before the NAT check was complete. And it took a long time due to NAT check timeouts.
This is anecdotally confirmed by users who report ~200 connections per hour for unrestricted proxies, while unrestricted ones usually get fewer than 10 per hour. Of course you could argue that this is just because there are far more clients with restricted NAT than with unrestricted, but perhaps the difference is still a bit too big.
Unfortunately, the broker metrics do not show us the counts of client NAT types. But here is a chart #40178 .
What I propose for a start is that we need to prioritize the STUN servers that actually support NAT checking. Perhaps we could provide this information through CLI / SOCKS options.
We do not do that currently.
We simply shuffle all the provided STUN servers and take a slice of them (see the code). That shuffle was introduced in !7 (merged) . And not many of the default STUN servers support NAT checking.
In addition, NAT checking can fail completely if none of the STUN servers from the subset contain one that supports it.
Just FYI, here is a list of STUN servers that support NAT checking, and here is another one.
And another improvement would be to make the client attempt to connect to restricted proxies initially, at least once I'd say. If the connection succeeds - great. If not, it will connect to an unrestricted proxy, or wait until the NAT check is complete.
UPD: this last idea is a duplicate of #40178
Related #40304