I think I found the cause. The snowflake-server process was running out of file descriptors.
I am not sure why that should have caused such a drastic reduction in throughput, but the log messages around the time of the drop are clear:
2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 10ms2022/10/04 17:13:59 handleConn: failed to connect to ORPort: dial tcp [scrubbed]->[scrubbed]: socket: too many open files2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 10ms2022/10/04 17:13:59 handleConn: failed to connect to ORPort: error reading TOR_PT_AUTH_COOKIE_FILE "/etc/extor-static-cookie/static_extended_orport_auth_cookie": open /etc/extor-static-cookie/static_extended_orport_auth_cookie: too many open files2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 10ms2022/10/04 17:13:59 handleConn: failed to connect to ORPort: error reading TOR_PT_AUTH_COOKIE_FILE "/etc/extor-static-cookie/static_extended_orport_auth_cookie": open /etc/extor-static-cookie/static_extended_orport_auth_cookie: too many open files2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 http: Accept error: accept tcp [scrubbed]: accept4: too many open files; retrying in 5ms2022/10/04 17:13:59 handleConn: failed to connect to ORPort: error reading TOR_PT_AUTH_COOKIE_FILE "/etc/extor-static-cookie/static_extended_orport_auth_cookie": open /etc/extor-static-cookie/static_extended_orport_auth_cookie: too many open files2022/10/04 17:13:59 handleConn: failed to connect to ORPort: error reading TOR_PT_AUTH_COOKIE_FILE "/etc/extor-static-cookie/static_extended_orport_auth_cookie": open /etc/extor-static-cookie/static_extended_orport_auth_cookie: too many open files2022/10/04 17:14:01 http: TLS handshake error from [scrubbed]: EOF2022/10/04 17:14:03 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: read: connection reset by peer2022/10/04 17:14:04 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: i/o timeout2022/10/04 17:14:07 reading token: websocket: close 1006 (abnormal closure): unexpected EOF2022/10/04 17:14:13 error copying ORPort to WebSocket EOF2022/10/04 17:14:13 reading token: websocket: close 1001 (going away)2022/10/04 17:14:15 reading token: websocket: close 1006 (abnormal closure): unexpected EOF2022/10/04 17:14:16 http: TLS handshake error from [scrubbed]: EOF2022/10/04 17:14:16 reading token: websocket: close 1006 (abnormal closure): unexpected EOF2022/10/04 17:14:18 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: i/o timeout2022/10/04 17:14:20 reading token: read tcp [scrubbed]->[scrubbed]: read: connection reset by peer2022/10/04 17:14:23 no address in clientID-to-IP map (capacity 98304)2022/10/04 17:14:24 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: i/o timeout2022/10/04 17:14:25 reading token: websocket: close 1006 (abnormal closure): unexpected EOF2022/10/04 17:14:25 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: i/o timeout2022/10/04 17:14:28 http: TLS handshake error from [scrubbed]: read tcp [scrubbed]->[scrubbed]: read: connection reset by peer2022/10/04 17:14:30 http: TLS handshake error from [scrubbed]: EOF2022/10/04 17:14:31 reading token: websocket: close 1001 (going away)2022/10/04 17:14:31 no address in clientID-to-IP map (capacity 98304)2022/10/04 17:14:32 reading token: websocket: close 1001 (going away)2022/10/04 17:14:35 reading token: websocket: close 1006 (abnormal closure): unexpected EOF2022/10/04 17:14:35 reading token: websocket: close 1006 (abnormal closure): unexpected EOF2022/10/04 17:14:35 reading token: websocket: close 1006 (abnormal closure): unexpected EOF
I edited /etc/systemd/system/snowflake-server.service to increase LimitNOFILE from 131072 to 1310720 (factor of 10)
and restarted at 2022-10-05 01:10:13. Now I see:
# cat /proc/49677/limitsLimit Soft Limit Hard Limit Units...Max open files 1048576 1048576 files...
I guess the requested value gets rounded to a power of 2.
It looks like it's working again, though bandwidth is being slow to recover,
so I will watch it for a while longer before closing the issue.
Traffic has at 10:20 still not recovered beyond 1.08 Gbps.
You're right. There must be something else going on. And it may be beyond what we have control of on the host.
I ran two speedtest downloads, one from Gothenburg, Sweden (ping 10 ms), and one from Dallas, USA (ping 136 ms). The one from Dallas was about 5% as fast (0.129 Gbps) as the one from Gothenburg (2.5 Gbps).
# curl --output /dev/null http://speedtest.tele2.net/10GB.zip % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 10.0G 100 10.0G 0 0 308M 0 0:00:33 0:00:33 --:--:-- 318M
# curl --output /dev/null http://speedtest.dallas.linode.com/garbage.php?ckSize=1000 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 1000M 0 1000M 0 0 16.1M 0 --:--:-- 0:01:01 --:--:-- 21.2M
The speed to/from Dallas is the same in both directions,
as I tested with some netcat transfers from a server located there.
I compared the two most recent sets of stats files published by the bridge, the division between which conveniently is within 15 minutes of the dropoff:
2022-10-04 17:01
2022-10-05 17:01
Below are the top 20 countries
according to dirreq-v3-reqs on 2022-10-04,
and how much they changed 24 hours later.
If there's a pattern,
it's that countries in or near Europe increased (de, gb, by, fr, nl),
while other countries remained about the same (ru, cn, in)
or decreased (ir, us, mu, tn).
country
2022-10-04 17:01:14
2022-10-05 17:01:14
increase/decrease
ir
52408
15680
−70%
??
12232
8208
−33%
us
10560
4480
−58%
ru
5232
5056
−3%
cn
656
600
−9%
mu
648
384
−41%
tn
472
200
−58%
de
408
624
+53%
ma
256
112
−56%
gb
208
272
+31%
eg
200
168
−16%
za
192
128
−33%
ng
184
112
−39%
by
168
200
+19%
zm
160
48
−70%
fr
152
272
+79%
nl
136
168
+24%
au
120
88
−27%
sd
120
80
−33%
in
112
104
−7%
There's a similar pattern in bridge-ips.
Some countries in or near Europe increased (de, fr),
but also, by the measure of bridge-ips,
some countries in Europe did not increase very much (gb, by).
Other countries remained about the same (ru, cn)
or decreased (ir, us, mu, tn, eg).
I am just now reflecting that these country-specific metrics are not really meaningful
if we are considering the problem of the bandwidth drop from a routing perspective.
That's because the countries of are those of the end users,
not of the Snowflake proxies, which are what connect directly to the bridge.
However the metrics do show that the cause is not a country-wide
block of broker rendezvous.
Offhand, I can't think of why lowered bandwidth at the bridge should cause reduced proxy polls.
It's possible the causation is the other way:
fewer proxy polls cause less bandwidth to be used.
But then what could cause a sudden drop in proxy polls?
The most obvious explanation is the broker deployment of #40193 (closed),
but the timing doesn't match up:
2022-10-03 12:50 versus 2022-10-04 17:15.
Version 0.6.2 of the WebExtension was released at about the right time
(the commit is from 2022-10-04 16:09),
but is it plausible it would cause such a massive drop all at once?
I would not expect all browser users to upgrade simultaneous,
nor the Firefox and Chrome stores to make the updates available at the same time.
Oh I hadn't thought about that. I would expect a difficulty with proxies to cause client polls to spike, not suddenly drop. But this is maybe too complex to intuit.
But then what could cause a sudden drop in proxy polls?
My mistake, the graph shows client polls, not proxy polls. Massive and immediate blocking of broker rendezvous channels could have that effect. The obvious suspect is Iran, but #40207 (comment 2840696) shows before-and-after large drops in other countries as well.
@cohosh what are the units on the vertical axis of the client requests graphs? If it said 75K, is that per second, per hour?
@shelikhoo I am seeing a lot more "rejected relay pattern from proxy" messages in the broker log. In fact, they seem to occur more frequently than SYNs to port 443, which should be impossible? Try this to see how fast they are coming in:
tail -F /var/log/snowflake-broker/current | head -n 1000
I measured the time to see 10000 "rejected relay pattern" logs and compared it to the time to see 10000 SYN packets. I measure about 345 "rejected relay pattern" per second, and only about 67 SYN per second.
I may be doing something wrong with the measurement, but can you help explain it?
# time tail -F /var/log/snowflake-broker/current | grep rejected | head -n 10000 > /dev/nullreal 0m28.973suser 0m0.243ssys 0m0.778s# time tcpdump -n 'dst port 443 and ((ip and tcp[tcpflags] & tcp-syn != 0) or (ip6 and (ip6[40+13] & tcp-syn) != 0))' | head -n 10000 > /dev/nullreal 2m28.874suser 0m0.001ssys 0m0.124s
The graph is generated by the following prometheus query:
sum by(nat) (increase(snowflake_rounded_client_poll_total[5m]))
where snowflake_rounded_client_poll_total is a monotonically increasing counter.
sum by(nat) just means that we've grouped all client polls by NAT type.
According to the prometheus query documentation, increase() shows by how much the counter has increased over the last 5 minutes. This value is given every second, to show a smooth curve.
If you want, I can show the results of a different query that might be more useful to interpret directly.
I am seeing a lot more "rejected relay pattern from proxy" messages in the broker log. In fact, they seem to occur more frequently than SYNs to port 443, which should be impossible?
Rejected proxies re-poll more frequently than proxies that are waiting for clients. We saw this effect in the prometheus metrics for the broker when support for these proxies was turned off last Monday:
My intuition would be that re-polling proxies are re-using their TCP sessions.
I updated the graph in the description to show the dynamics over the past 2 days of diminished bandwidth. It's not flat or capped at 1 Gbps, but still has roughly the expected diurnal pattern. The few little spikes are bandwidth tests being run from the bridge.
Rebooted the bridge at 2022-10-06 19:48:48. It was operational again by 2022-10-06 19:51:18 when I was able to log in with ssh. I'm not seeing any increase in bandwidth, but it was a long shot anyway.
However the metrics do show that the cause is not a country-wide block of broker rendezvous.
I have a new working hypothesis: the sudden decline is caused by a partial block of the broker in Iran.
As for why countries other than Iran are seemingly affected,
my best guess is that they are geolocation errors:
IP addresses in Iran wrongly being attributed to other countries
(chiefly us and ??).
Below are the top 20 countries according to dirreq-v3-reqs on 2022-10-04, and how much they changed 24 hours later.
You can see the overall massive decline between 04 Oct and 05 Oct.
The change from 52408 to 15680matches what I posted in the first table
at #40207 (comment 2840696).
Unlike typical Tor Metrics graphs, this visualization
does not sum the separate tor instances into a single point,
and it does not linearly distribute the counts over 24-hour
periods,
instead showing the exact timestamp dirreq-stats-end associated with dirreq-v3-reqs in the descriptor.
The apparent decline in ru at the left is illusory,
caused by increasing the number of tor instances
(#40173 (closed), #40176 (closed); more instances ⇒ less traffic per descriptor).
What I want to call out is: See how us and ?? increase/decrease along with fluctuations in ir.
It is almost as if those series are scaled-down versions
of the ir series.
If we assume that the only significant change at the time was in ir,
then we can say the value of us (the number of users that
geolocate to the US, correctly or incorrectly) is
us = P(geo(us)|¬ir)×N(¬ir) + P(geo(us)|ir)×N(ir)
where P(X|Y) is the probability that geolocation guesses
X when the true country is in Y,
and N(X) is the unknown true number of users
from countries in X.
When N(ir) decreases, the left-hand term stays the same,
but the right-hand side decreases,
and I think this is what is happening in the graph.
If this working hypothesis is correct,
the question becomes,
what is the nature of the block?
To this question we do not yet have a clear answer.
One guess is that it has to do with the Golang TLS fingerprint in currently shipped version of snowflake-client. That arose out of this comment that said Tor Browser for desktop worked and Orbot did not work, on the same network.
some difference I see between TB and Orbot handhakes: they use the same cipher-suites, but not in the same order. The TLS extension supported_version list only 1.2 and 1.3 for TB, but also 1.1 and 1.0 for Orbot.
So I consider TLS fingerprinting to remain a possibility.
tpo/applications/tor-browser-build#40629 (closed) will upgrade snowflake-client to be a version that supports utls in a future version of Tor Browser. I don't know about the release plan for Orbot.
Another report showed that all DNS lookups failed while trying to use Snowflake on a mobile network. I wasn't able to get more information about whether it was only Snowflake-related domains that were affected (or if all DNS was blocked for the user, or if it was a misconfiguration).
TorBrowser on Linux and TorBrowser on Android share the same cipher-suite order, however Orbot has a separate order (but same cipher-suite set). This translate to a different JA3 hash (commonly used for TLS fingerprinting in IDS): 3fed133de60c35724739b913924b6c24 for TBs, 049f44ae40ab2cab555bdfee22e7d7cb for Orbot.
TorBrowser on Linux and Orbot both claims to support TLS 1.0-1.3, whereas TorBrowser on Android support only TLS 1.2-1.3. This does not change the JA3 fingerprint as it's sensitive to list and order of extensions, but not their size or content.
Your tb-android.pcap has fingerprint
14062e58336049c2
and your tb-linux.pcap has fingerprint
99c75b317dc7a5a1.
The two are very similar as you said,
the only difference being the Supported Versions field.
I am guessing, based on the fact that your Android fingerprint
prioritized AES-GCM ciphersuites,
that the hardware of whatever device you tested it on satisfies
the cpu.ARM64.HasAES && cpu.ARM64.HasPMULL test.
My guess as to why I saw the fingerprint adfe55afa6f23950,
which de-prioritizes AES-GCM, in the amd64 desktop version of Tor Browser,
is that I was running it in a QEMU VM,
which may not have signaled the needed CPU features.
(I just checked, and indeed, inside the VM,
cat /proc/cpuinfo does not have the flags aes and pclmulqdq.)
The upshot of all this is that we can likely get snowflake-client
to work again in Iran by shipping utls support.
The critical platform for this to happen is Android,
because the native Go crypto/tls fingerprint on amd64 happens not to be blocked.
It should be noted that AES acceleration support is more likely to be found on high-end smartphone, while, more affordable phone that use 32-bit version of arm usually don't have AES acceleration support.
TorBrowser on Linux and Orbot both claims to support TLS 1.0-1.3, whereas TorBrowser on Android support only TLS 1.2-1.3.
This is another aspect that apparently is important, which is investigated at tpo/anti-censorship/team#96 (comment 2845607). The Go crypto/tls fingerprint changed in go1.18 just as you describe: it raised the default minimum supported version in Client Hellos from TLS 1.0 to TLS 1.2. It appears that only the min = TLS 1.0 (go1.17) fingerprint is on the current Iran blocklist.
Tor Browser for Android + Snowflake [11.5.4 without any bridge line changes] works, but Orbot (iOS and Android) with snowflake is not working.
Tor Browser 11.5.4 for Android uses go1.18.7, while Tor Browser 11.5.4 for desktop uses go1.17.13. (That's right, the Tor Browser build uses different versions of the Go compiler depending on the target.) Since neither of these two configurations is blocked, we surmise that both conditions have to be true in order to get blocked: you need to be using the non-accelerated AES ciphersuite list (more common on mobile), and you need to have a snowflake-client compiled by a pre-go1.18 version of the compiler (minimum TLS version = 1.0).
Of course, it is probably easy for the censors to add the go1.18 fingerprint to their blocklist, so proper camouflage in the form of uTLS is the right answer. But recompiling client programs with go1.18 could buy a few days' time.
I think the analysis is finished. In summary, we believe that the sudden drop in users and bandwidth on 2022-10-04 was caused by blocking of a specific TLS fingerprint in Iran. The TLS fingerprint was the one used by Snowflake in Orbot on most types of mobile devices (and likely other Go-based tools). Closely related fingerprints (different because of accidents of hardware AES support or compiler version) used by Tor Browser on desktop and Android were not blocked. The fact that certain other country codes (notably us and ??) experienced a simultaneous decrease is attributed to geolocation errors (IP addresses actually in Iran being geolocated elsewhere).