The CAPTCHA Monitoring project aims to track how often CDN (for ex. Cloudflare, Akamai, Amazon Cloudfront, etc.) fronted webpages return CAPTCHAs to Tor clients. The project aims to achieve this by fetching webpages via both Tor and other mainstream web browsers and comparing the results. The tests are repeated periodically to find the patterns over time. Collected metadata, metrics, and results are analyzed and displayed on a dashboard to understand how CDN providers manipulate internet traffic and affect people's access to the internet.
Interesting places to visit
- dashboard.captcha.wtf or captchaufjq5m2i73up537pldaxnbp6rzcbdrzc7y5rlwtx3mwigznad.onion
- api.captcha.wtf or capi4ljiudrzsnnlcnjror4ziizzbxevyngy5sbtxaato6v6gv5ck3qd.onion
Code
The codebase consists of four separate repositories that are dedicated to the four different components of the project:
- The code for the core of the project that does measurements is located in this GitLab repository
- The code for the dashboard for visualizing the results is located this GitLab repository
- The code for API is located this GitLab repository
- The code for the websites fronted by Cloudflare is located this GitLab repository
Documentation
- Core CAPTCHA Monitor code documentation -> Read the Docs page [Not updated at the moment]
- Interactive API documentation -> api.captcha.wtf
Dataset
You can view various visualizations of the collected data on the dashboard. If you prefer to access the raw data and conduct your own research, you can use the API to fetch the data.
If you want to get a copy of the whole database, I would be very happy to share it, please contact me.
Detailed description
(Keep in mind that the project was focused only on Cloudflare initially, and later expanded to tracking other CDN providers as well.) By design, Cloudflare is developed to alter the traffic between the web servers and internet users. Cloudflare modifies the internet traffic to protect the Cloudflare fronted web servers from various attacks from users with malicious intentions. Even though this seems like a practice with good faith on the surface to protect servers, it harms millions of users more than doing good. Cloudflare makes decisions to block or not to block users based on multiple factors such as visitor's IP address, resources requested, request payload and frequency, and customer-defined firewall rules (Source). They don't share the specifics of their decision-making mechanism since it keeps changing over time, and it is not open-source. However, this doesn't stop us from experimenting with the algorithm and understanding how it decides to block/not block users.
Cloudflare mentions that IP address based rules have the highest hierarchy, and it is followed by Firewall Rules, Zone(URL) Lockdown, User Agent Blocking, and Web Application Firewall (Source). Thus, Cloudflare clearly states in their documentation that they do consider the user's IP addresses and their web browser's User Agent while deciding to block a user. Unfortunately, Cloudflare algorithms trigger all red flags when these two parameters (IP address and user agent) are matching to a typical Tor user. This is an easy thing to do for Cloudflare because Tor Browser uses the one fingerprint for all philosophy, and the list of Tor exit nodes is publicly available. The Cloudflare CTO himself, explains that they fetch the list of Tor exit nodes and assign a reputation to the nodes in trac ticket:18361#comment:23 to block certain users.
Currently, there are a few research projects (like Khattak et al. and Singh et al.) on the Tor user blocking practices, but there is no public tool and/or database collecting data regularly on Cloudflare's Tor user blocking practices, to the best of my knowledge. Thus, this project aims to develop tools to monitor this issue and create a database for public usage. Eventually, once there is enough data accumulated, this tool is aimed to function as a data source for the Tor Metrics project. It was also observed that a lot of users struggle with reliably reproducing the Cloudflare behavior to report in their tickets since there are too many variables involving the process. Thus, this project can be used as a standardized toolset to reproduce Cloudflare's behavior since many of the variables are controlled within the project. The collected data might serve as a reference point to the measurements done by the individual users.
Expected long-term impact
- Creating an up to date and reliable data source for further research on the topic
- Integrating the collected data to Tor Metrics
- Reducing and relaxing the Cloudflare's CAPTCHA policies
- Helping Tor users browse the internet without sacrificing privacy and getting discriminated
Approach
- Having Cloudflare fronted websites to simulate various configurations that can be done by the Cloudflare users (take a look at the domains used for testing section)
- Periodically fetching these websites via Tor and other mainstream web browsers that are not using Tor
- Recording if a CAPTCHA is returned during the website fetches and other additional predefined metrics
- Visualizing the results in a dashboard (dashboard.captcha.wtf) and analyzing the collected data
- Tracking and making the dataset & the results publicly available (api.captcha.wtf)
Here is a diagram that explains the approach in detail:
Metrics to track
Here are some of the questions that the project will try to answer by tracking related metrics to these questions. Some of these questions are trying to find answers to the questions asked and issues reported by the community.
- Does Cloudflare treat IPv4 and IPv6 addresses differently? [ticket:33010#comment:2]
- How does the HTTP request headers affect Cloudflare's decision-making mechanism? [ticket:33010#comment:4]
- Is there a difference between using the actual Tor Browser itself and tor-browser-selenium in terms of the HTTP headers?
- How does Cloudflare react differently if the browser doesn't support alt-svc headers? [ticket:32915]
- How do different browsers with different User Agents get affected? [ticket:33010#comment:2], [ticket:32924], [ticket:31404]
- Is there a difference between using a web browser or fetching web pages via cURL or other HTTP libraries?
- What about the different versions of the Tor Browser? Does Cloudflare behave differently to different versions of the same browser?
- What about the different security levels of Tor Browser?
- How does Cloudflare react to browsers with and without JavaScript enabled? [ticket:31404]
- What kind of per browser session tracking and blocking is actually happening? [ticket:18361]
- How does having pre-existing cookies for other websites affect Cloudflare's behavior? [ticket:18361#comment:7], [ticket:23840#comment:26]
- How do different security levels of Cloudflare affect the blocking mechanism? [ticket:33010#comment:5]
- Do some of the Cloudflare security levels block users immediately without presenting a CAPTCHA challenge at all?
- How does the time of the day affect the Cloudflare's blocking mechanism? Does it matter the day of the week or the time of the day? [ticket:33010#comment:15]
- How often does Cloudflare's blocking mechanism change/update itself?
- How do specific exit nodes get affected by Cloudflare's blocking practices?
- Does the size/age/location of the exit node play a role? [ticket:33010#comment:15]
- Is it always the same Tor exit nodes that get blocked?
- How well does Cloudflare keep track of the new or old Tor exit nodes?
- How frequently Cloudflare updates its Tor exit node list?
- What fraction of the Tor exit nodes get affected by Cloudflare's blocking practices? [ticket:33010], [ticket:23840#comment:22]
- What is the chance of a Tor client getting affected by Cloudflare's blocking practices when choosing a Tor exit node? [ticket:33010]
- Is there a difference between websites that load resources from third-party resources and websites that contain all resources on the origin server? [ticket:33010#comment:6]
- How do users of websites get affected if the main website is not fronted by Cloudflare, but some of the resources are fetched from a Cloudflare fronted web server? [ticket:33010#comment:6], [ticket:15450]
- Is there a difference if the origin server has an SSL certificate or not?
- Does the blocking change if the SSL certificate is issued by Cloudflare or by another entity?
- If browsers that should not face CAPTCHA face CAPTCHA, why does this happen?
- How do the observed patterns in the results change over time? [ticket:33010]
- Is whether you get a CAPTCHA much more probabilistic and transient? [ticket:33010]
- The chance that a Tor client, choosing an exit relay in the normal weighted faction, will get hit by a CAPTCHA [ticket:33010]
Related tickets
The original ticket initiated this project can be found here: #33010
- #18361 - Issues with corporate censorship and mass surveillance
- #23840 - Google's reCAPTCHA fails 100%
- #24351 - Block Global Active Adversary Cloudflare; The Great Cloudwall [The original page was deleted by cyberpunks, a mirror can be found here https://codeberg.org/crimeflare/cloudflare-tor]
- #31404 - Unsolvable reCAPTCHAs
- #32915 - Cloudflare alt-svc failures cause spurious "DNS resolution error" in Tor Browser
Roadmap
Please consider taking a look at the CAPTCHA Monitor Project's Kanban board for the most up-to-date information.
-
Create Cloudflare fronted websites
- IPv4 and IPv6 only domains (as suggested by ticket:33010#comment:2)
- Build a simple website fetcher to collect data
- Check for the existence of the "Cloudflare" string in the returned website (as suggested by ticket:33010#comment:25)
- Create a simple dashboard for displaying collected data
-
Make the dataset downloadable
- The dataset can be downloaded through the API
- Have a working minimum viable product
- Integrate Tor Stem
- Integrate more web browsers
- Integrate older versions of the web browsers as well
- Integrate Cloudflare API not to change Cloudflare settings (of the websites) manually
- Optimize the data storage format
- Write tests
- Enhance the available visualizations on the dashboard
-
Split the codebase into more modular pieces that can be chained, create a pipeline
- CAPTCHA Monitor core
- A tool for organizing/compacting the data
- API
- Dashboard
- Brainstorm about new metrics to collect
- Find more third-party websites to track
-
Complete dashboard v2
- Write the code for the backend
- Implement the new dashboard UI with the graphs
- Update tests
- Increase the number of completed measurements per hour
- Submit a report the Tor Research Safety Board
- Brainstorm the integration with OONI people
- Brainstorm the integration with Tor Metrics people
-
Create an API for people to fetch data easily api.captcha.wtf
- Create a new endpoint to the API for performing measurements on the user-provided websites
Domains used for testing
Down the road, I ended up getting more Cloudflare fronted domains for testing purposes. Feel free to use any of these domains for experimenting. They all point to the same static resources. You can take a look at this repository to learn more about the exact configurations.
Note: The domains below are listed for transparency and CAPTCHA Monitor uses many more domains for the measurements. You can find the complete list here.
Here is the complete list of domains owned by the CAPTCHA Monitoring project and used for testing:
- captcha.wtf
- IPv4 only domain, no additional Cloudflare firewall rules
- yearlight.buzz
- IPv4 only domain, Cloudflare firewall is set to present "JS Challenge" for traffic originating from the Tor network
- bottomlesspit.xyz
- IPv4 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for traffic originating from the Tor network
- broccolipizza.monster
- IPv4 only domain, Cloudflare firewall is set to block all traffic originating from the Tor network
- exit11.online
- IPv6 only domain, no additional Cloudflare firewall rules
- icanhazcaptcha.xyz
- IPv6 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for traffic originating from the Tor network
Development
Georg Koppen (@gk) & Roger Dingledine (@arma) are the mentors of this project, and currently, I'm (woswos) the only developer of this project. I started developing this project as a part of the Google Summer of Code program and you can take a look at the GSoC 2020 wiki page to see work completed during the GSoC period.
Contact
If you have any questions, concerns, feedback, etc. you can reach me on the #tor-dev or #tor-project channels on OFTC IRC. My IRC handle is woswos, and if you need help with connecting to IRC, you can follow this tutorial.
You can also email me at <barkin(at)nyu(dot)edu>
Reporting bugs
Please use the respective repository mentioned in the code section for reporting bugs. If you are not sure about which one to choose, you can use the current repository.
Contributing
The CAPTCHA Monitoring project consists of different components that use different programming languages and frameworks:
- Core of CAPTCHA Monitor -> Python
- API -> JavaScript and Express.js
- Dashboard -> Bootstrap, Pug, JavaScript, and Express.js
You are welcomed to contribute any of them and thank you so much for doing it!