Monitor cloudflare captcha rate: do a periodic onionperf-like query to a cloudflare-hosted static site

Trac:
Child Ticket(s): #34287 (moved), #34288 (moved), #34289 (moved), #34290 (moved), #34291 (moved), #34294 (moved), #34297 (moved)

added component::metrics/ideas gsoc-ideas network-health owner::metrics-team priority::medium severity::normal status::new type::task labels

This project would be a great gsoc idea.

Trac:
Cc: gk to gk, pili

Please consider setting up both, IPv4 - & IPv6 only domain. 1x A record only & 1x AAAA record only. To test them individually. As the exiting IP will be punished differently like another's IP but while it's the same exit, only different protocol.

Also important can be the first seen date of a fingerprint. To group out if only "fresh' exit IPs can reaches it's destinations for a short period of time until they are burned with endless troll captcha.

We may have a nice static site already ready for mirror, Tor blog?

This will hopefully help to proof all the frustration and headache that cloudflaw is throughing against all Tor users on daily basis.

For every UA not a browser, I guess >90 fail rates. It's UA discrimination from my personal experience.

Cc'ing haxxpop too, so he can follow along. In an ideal world, Cloudflare would collaborate on making this external monitoring tool be useful for everybody involved. Maybe they even want to put an intern on it this summer. :)

Trac:
Cc: gk, pili to gk, pili, haxxpop

It's quite important to make the request headers look like the ones from Tor Browser as much as possible because sometimes we consider traffic from Tor network with not-Tor-Browser-look-like headers as malicious.

One catch is that Cloudflare currently gives alt-svc headers in response to fetches from Tor addresses. So that means we need a web client that can follow alt-srv headers -- maybe we need a full Selenium like client?

Tor Browser does not upgrade immediately, so that's not too much reason to use real web browser. However, Cloudflare is doing fingerprinting of TLS handshake Client Hello (cipher suites and group in tls 1.3) to tell real Tor Browser from spoofed. Must build curl from NSS and set correct headers and cipher suites to pass.

Cloudflare have also different levels of protection, and some grandfathered protection levels have no Tor Browser whitelisting. Should test them all.

There is also case where only subresource requests trigger captcha, but is not displayed to user. This make sites break and no way for resolve because user cannot see captcha!

Example site https://kiwiirc.com/nextclient/

Open network panel in dev tools and visit link. You will see that javascript resources are 403 forbidden and require captcha, but this not displayed to user. If you open 403 urls in url bar, is working without problem. Difference is Accept header.

Only one thing worse than reCRAPtcha is invisible reCRAPtcha. At least visible captcha I can solve and access site. Invisible captcha is just access denied without telling you.

What is point of captcha if cannot be seen, cloudflare?

https://github.com/NullHypothesis/exitmap/blob/master/src/modules/cloudflared.py

This better fits into our Ideas subcomponent.

Trac:
Component: Metrics/Exit Scanner to Metrics/Ideas

Trac:
Keywords: network-health deleted, network-health gsoc-ideas added

If tor project can provide a list of sites which blocking Tor that would be useful.

example link. Green checkmark: Tor passed, Red: Tor browser simulation denied. Also this link.

Replying to cypherpunks:

It's UA discrimination from my personal experience.

This is true. See "Browser vendor discrimination". (it's not secure than torbrowser but there are people who use Chromium/Firefox over Tor daemon)

Replying to arma:

[snip]

One catch is that Cloudflare currently gives alt-svc headers in response to fetches from Tor addresses. So that means we need a web client that can follow alt-srv headers -- maybe we need a full Selenium like client?

The alt-svc is not kicking in with the first load. So, if we use a really simple static page (that is with nothing dynamic and no sub resources being requested subsequently) we should not hit that complicating factor.

That said using Tor Browser for the case where we actually want to find out the Tor Browser experience seems like a thing we should investigate, and be it alone for the reason mentioned in comment:4. There is tor-browser-selenium and various forks that should do the trick in combination with stem.

(fix typo)

Trac:
Description: We should track the rate that cloudflare gives captchas to Tor users over time.

My suggested way of doing that tracking is to sign up a very simple static webpage to be fronted by cloudflare, and then fetch it via Tor over time, and record and graph the rates of getting a captcha vs getting the real page.

The reason for the "simple static page" is to make it really easy to distinguish whether we're getting hit with a captcha. The "distinguishing one dynamic web page from another" challenge makes exitmap tricky in the general case, but we can remove that variable here.

One catch is that Cloudflare currently gives alt-svc headers in response to fetches from Tor addresses. So that means we need a web client that can follow alt-srv headers -- maybe we need a full Selenium like client?

Once we get the infrastructure set up, we would be smart to run a second one which is just wget or curl or lynx or something, i.e. which doesn't behave like Tor Browser, in order to be able to track the difference between how Cloudflare responds to Tor Browser vs other browsers.

I imagine that Cloudflare should be internally tracking how they're handling Tor requests, but having a public tracker (a) gives the data to everybody, and (b) helps Cloudflare have a second opinion in case their internal data diverges from the public version.

The Berkeley ICSI group did research that included this sort of check: https://www.freehaven.net/anonbib/#differential-ndss2016 https://www.freehaven.net/anonbib/#exit-blocking2017 but what I have in mind here is essentially a simpler subset of this research, skipping the complicated part of "how do you tell what kind of response you got" and with an emphasis on automation and consistency.

There are two interesting metrics to track over time: one is the fraction of exit relays that are getting hit with captchas, and the other is the chance that a Tor client, choosing an exit relay in the normal weighted faction, will get hit by a captcha.

Then there are other interesting patterns to look for, e.g. "are certain IP addresses punished consistently and others never punished, or is whether you get a captcha much more probabilistic and transient?" And does that pattern change over time?

to

We should track the rate that cloudflare gives captchas to Tor users over time.

My suggested way of doing that tracking is to sign up a very simple static webpage to be fronted by cloudflare, and then fetch it via Tor over time, and record and graph the rates of getting a captcha vs getting the real page.

The reason for the "simple static page" is to make it really easy to distinguish whether we're getting hit with a captcha. The "distinguishing one dynamic web page from another" challenge makes exitmap tricky in the general case, but we can remove that variable here.

One catch is that Cloudflare currently gives alt-svc headers in response to fetches from Tor addresses. So that means we need a web client that can follow alt-srv headers -- maybe we need a full Selenium like client?

Once we get the infrastructure set up, we would be smart to run a second one which is just wget or curl or lynx or something, i.e. which doesn't behave like Tor Browser, in order to be able to track the difference between how Cloudflare responds to Tor Browser vs other browsers.

I imagine that Cloudflare should be internally tracking how they're handling Tor requests, but having a public tracker (a) gives the data to everybody, and (b) helps Cloudflare have a second opinion in case their internal data diverges from the public version.

The Berkeley ICSI group did research that included this sort of check: https://www.freehaven.net/anonbib/#differential-ndss2016 https://www.freehaven.net/anonbib/#exit-blocking2017 but what I have in mind here is essentially a simpler subset of this research, skipping the complicated part of "how do you tell what kind of response you got" and with an emphasis on automation and consistency.

There are two interesting metrics to track over time: one is the fraction of exit relays that are getting hit with captchas, and the other is the chance that a Tor client, choosing an exit relay in the normal weighted fashion, will get hit by a captcha.

Then there are other interesting patterns to look for, e.g. "are certain IP addresses punished consistently and others never punished, or is whether you get a captcha much more probabilistic and transient?" And does that pattern change over time?

WARNING: I changed the way these two domains are registered on Cloudflare. All pages and subdomains still exist in the way explained here. That being said, now captcha.wtf has only IPv4 entries and exit11.online has IPv6 entries as suggested by everyone. Previously, I had trouble with getting an IPv6 address to my server.

Please take a look at this wiki page for the most up to date information.

Rest of the original post:

I wanted to conduct a few simple experiments on this issue. I will start by explaining my setup and continue with the experiments themselves.

Domain Setup I registered two domains (captcha.wtf and exit11.online) with IPv4 records on Cloudflare. After playing with Cloudflare settings, I understood that domain owners have an important role in the way Cloudflare blocks Tor users.

A new free Cloudflare account comes with a default security level (like the security levels in the Tor browser and as comment:5 mentioned), and the default security level doesn't explicitly block Tor users. I am not saying Cloudflare is innocent, but they don't mention a possible Tor user blocking at this security level. However, Tor shows up as a country on the Cloudflare firewall settings, and it is possible to block Tor users based on this firewall rule. I think they have a list of Tor exit node IPs, and they use this list to perform the filtering. They "offer" JS and Captcha challenges in addition to simple blocking, as shown in the image below:

![https://bottomless-pit.barkin.io/tor-firewall-rules.png, width=100%](https://bottomless-pit.barkin.io/tor-firewall-rules.png, width=100%)

I think that's why some Tor users face more captcha challenges at higher Tor browser security levels. JavaScipt is blocked at higher security levels, and they can't pass the Cloudflare JS challenges. \ Also, if a firewall rule related to Tor is set, Cloudflare applies that rule (for example, the never-ending captcha challenge) all the time even if the user has somehow managed to pass the challenge 5 seconds ago - I think that is the part all of us hate, it just creates an endless loop. A sample Cloudflare firewall record below shows that the same IP address is continuously challenged over and over again, even after successfully passing the captcha challenge.

![https://bottomless-pit.barkin.io/tor-firewall-1.png, width=100%](https://bottomless-pit.barkin.io/tor-firewall-1.png, width=100%) \ exit11.online has the default Cloudflare configuration without any additional firewall or protection. I am guessing that this would be the case with most of the average Cloudflare users. I also registered the bypass.exit11.online subdomain, which bypasses the Cloudflare proxy and only utilizes Cloudflare as a DNS hosting service and CDN.

![https://bottomless-pit.barkin.io/tor-cloudflare-exit11.png, width=100%](https://bottomless-pit.barkin.io/tor-cloudflare-exit11.png, width=100%) \ captcha.wtf has the default Cloudflare configuration with the additional firewall configuration for blocking Tor users, as I have mentioned previously. I registered this second domain to see the difference between using the default Cloudflare settings and adding additional firewall rules. I also registered the bypass.captcha.wtf subdomain, which bypasses the Cloudflare proxy and only utilizes Cloudflare as a DNS hosting service and CDN.

![https://bottomless-pit.barkin.io/tor-cloudflare-wtf.png, width=100%](https://bottomless-pit.barkin.io/tor-cloudflare-wtf.png, width=100%)

![https://bottomless-pit.barkin.io/tor-cloudflare-wtf-firewall.png, width=100%](https://bottomless-pit.barkin.io/tor-cloudflare-wtf-firewall.png, width=100%) \ Both of these domains have a very simple static "Hello world!" page at /index.html, and there is a more complicated page at /complex.html that loads resources from different locations. Additionally, captcha.wtf & exit11.online have SSL certificates issued by Cloudflare and bypass.captcha.wtf & bypass.exit11.online have SSL certificates issued by Let's Encrypt. I thought that these might have an effect on the way Cloudflare behaves.

Experimenting Later, I used the Python script mentioned in comment:7 (it uses httplib) and the tor-browser-selenium mentioned in comment:12 to conduct a few simple experiments. I wrote another script to fetch different domain combinations via tor-browser-selenium and Python's httplib. For example, fetching bypass.exit11.online, exit11.online, exit11.online/complex.html, and bypass.exit11.online/complex.html via both tor-browser-selenium and Python's httplib.

Results After fetching each combination about 100 times at one-minute intervals, the domain with the default configuration (exit11.online) was not blocked a single time via both Tor and httplib. However, the domain with additional firewall configuration (captcha.wtf) was blocked every single time when fetched via Tor. Of course, both of the bypass subdomains were fine since Cloudflare proxy was disabled, but I wanted to test it anyway.

Possible Conclusions I'm sure my simple tests are not enough at all to draw a meaningful conclusion, but these results make me question the role of domain owners in this endless captcha problem. The domain with default Cloudflare configurations didn't block Tor users, but the domain with extra firewall configuration set by the domain owner banned Tor users all the time. However, again, this is an observation based on my very limited experiments.

I want to conduct more advanced experiments based on your feedback to address the metrics mentioned in the original ticket and find possible patterns in the recorded data.

Please feel free to use both of these domains for further testing.

Some ideas worth keeping in mind, which irl brought up the other day:

Is there a ipv4/ipv6 difference? Does it matter which day of the week/time of the day sites are getting visited? Does size of the exit relay play a role (larger might carry "more" abusive traffic)? If we check Tor Browser we should have a Firefox control group (maybe with FPI and RFP on)/other tool using just tor (curl/Firefox).

I did additions to the repository I mentioned in comment:14 and I deployed the code to a cloud server, specifically the automated_fetcher_influxdb example.

Now, the server is fetching captcha.wtf & exit11.online pages and their combinations with & without the Tor browser at 15 minutes intervals. The full list of URLs tested is here. Later, the results are sent to an InfluxDB database.

I created a public Grafana dashboard at dashboard.captcha.wtf to quickly visualize the collected data. You can visit the dashboard to see the data collected so far. I will add more panels and analysis to the dashboard as I implement more metrics to track.

Note: captcha.wtf & exit11.online websites and the automated_fetcher_influxdb code are not hosted on the same server. They all have different IP addresses if anyone is wondering.

I wanted to share this lovely(!) patent, just in case anyone missed it:

Blocking via an unsolvable CAPTCHA https://patents.google.com/patent/US9407661

Replying to woswos:

Blocking via an unsolvable CAPTCHA https://patents.google.com/patent/US9407661

yes, they own a so called Troll Captcha patent and recaptha effectively presents you this type of unsolvable captcha. or by connecting through exit node, just the Message of "generate an unsolvable challenge-response test based on identifying the request as being associated with the malicious activity."

While "associated with the malicious activity" is already an high amount of requests that any node is processing.

But did you notice cloudflare seems to have changed captcha provider from recaptcha to ?

Monitor cloudflare captcha rate: do a periodic onionperf-like query to a cloudflare-hosted static site

Child items 0

Activity