Welcome to CAPTCHA Monitoring project's wiki!
This wiki page contains the final report for the "Tor Project: Cloudflare CAPTCHA Monitoring" project for Google Summer of Code 2020. It is a broad overview of the work completed during the GSoC period, and you can find more detailed & latest information in the home wiki page.
What is this project about?
The CAPTCHA Monitoring project aims to track how often CDN (for ex. Cloudflare, Akamai, Amazon Cloudfront, etc.) fronted webpages return CAPTCHAs to Tor clients. The project aims to achieve this by fetching webpages via both Tor and other mainstream web browsers and comparing the results. The tests are repeated periodically to find the patterns over time. Collected metadata, metrics, and results are analyzed and displayed on a dashboard to understand how CDN providers manipulate internet traffic and affect people's access to the internet.
What work has been completed during the GSoC period?
I have been personally annoyed by receiving CAPTCHAs while using Tor, and going through the Tor Project's issue tickets showed that I wasn't alone in this, especially ticket #33010. After years of complaints from users and research papers published on the topic, it was clear that a public database & data collection tool was needed to back up the claims and let CDN companies take action. So, the CAPTCHA Monitor was born. Since this project didn't exist before, I designed the whole system from scratch and built it during GSoC. The designs of other similar tools, such as OONI, Tor Metrics, and ExitMap, were influential in the decisions I made.
Next, I compiled a list of related tickets & comments from Tor Project's bug tracking system (see metrics to track section) to understand which metrics are valuable to collect and what the community wants to learn. These findings helped me to tune my design further and build a roadmap.
Here is a high-level overview of the design I implemented:
There are five separate repositories dedicated to different parts of the project. I will explain the work completed for each repository separately, and here you can see the hierarchy of the repositories:
CAPTCHA Monitor |-- Core | |-- Web | `-- HTTP Header Live |-- API `-- Dashboard
CAPTCHA Monitor Core
The core is responsible for performing the measurements, analyzing the results, and storage. The
compose submodule periodically fetches the list of URLs from the database and fetches new exit relays from the consensus. Later, it schedules measurement jobs by using the URL and exit relay list. Meanwhile, the
run submodule runs multiple workers in parallel to process the measurement jobs by letting Tor connect to the requested exit relays and fetching the URL via the Tor browser (or another web browser requested). Finally, the results are stored in the database. The code, issues, and documentation related to the CAPTCHA Monitor Core can be found in this repository.
Additionally, the CAPTCHA Monitor Core relies on two other repositories to function. The first one is HTTP Header Live repository. It contains a modified version of the HTTP Header Live web browser extension by Martin Antrag. HTTP Header Live is an extension that supports both Firefox & Chromium, and it records a copy of the HTTP headers while fetching pages. The modified version of the extension can interface with the CAPTCHA Monitor Core and export the headers in a particular JSON format.
The other repository is CAPTCHA Monitor Web repository. It contains the code for websites served by Cloudflare (see domains used for testing section) and the Nginx configuration of the webserver. These websites are used during the measurements to test specific properties of the Cloudflare blocking algorithm.
CAPTCHA Monitor API
The API is responsible for serving the collected data over a RESTful API. It both feeds the dashboard and provides open access to the collected data. The code, issues, and documentation related to the CAPTCHA Monitor API can be found in this repository. A live instance of the API can be accessed at api.captcha.wtf or capi4ljiudrzsnnlcnjror4ziizzbxevyngy5sbtxaato6v6gv5ck3qd.onion
CAPTCHA Monitor Dashboard
The dashboard is used for visualizing the analyzed data and for detecting anomalies in the trends. The code, issues, and documentation related to the CAPTCHA Monitor Dashboard can be found in this repository. A live instance of the dashboard can be accessed at dashboard.captcha.wtf or captchaufjq5m2i73up537pldaxnbp6rzcbdrzc7y5rlwtx3mwigznad.onion
So far, I have observed that using the Tor Browser Bundle out of the box without changing its configurations doesn't lead to a high CAPTCHA rate on Cloudflare fronted websites (assuming the website owners don't explicitly block exit relays). That said, modifying the user-agent or any other modifications that deviate your browser's fingerprint from a typical Tor Browser user, significantly increases the chance of getting CAPTCHAs. For example, using the regular Firefox over Tor resulted in getting CAPTCHAs in ~90% of the measurements. I believe Cloudflare is very aggressive against the "Firefox over Tor" users because many people, unfortunately, use Chromium/Firefox + Selenium + Tor to scrape web pages and bypass IP-based rate limits.
Additionally, I observed that the TLS fingerprint has a significant role in whether someone gets a CAPTCHA or not. As a part of the project, I decided to capture the HTTP headers during measurements to understand how they affect the CAPTCHA rates. Initially, I was using a Python library called seleniumwire to capture the HTTP headers by intercepting the traffic between the Tor Browser and Tor. By doing this, I got a very high CAPTCHA rate, like 98% of the time. seleniumwire forwards the traffic transparently, but it has a different TLS fingerprint than Tor Browser. I figured out that the difference in the TLS fingerprints was triggering the MITM detection on the Cloudflare side, thus, resulting in very high CAPTCHA rates.
Interestingly, I tried using the exact same Tor Browser & seleniumwire setup, but without Tor and, practically, I didn't get any CAPTCHAs. I believe the MITM detection is more aggressive if the traffic is coming through an exit relay. So, I stopped using seleniumwire to capture headers because it didn't reflect what a real human Tor Browser user is usually experiencing and started using the HTTP Header Live web browser extension by Martin Antrag.
The weekly blog posts that were posted and emails sent to the Tor mailing lists during the GSoC period can be found in the updates page. They are good for understanding how things evolved over time, especially the blog posts.
What you would do differently if you did it all again?
Before starting to work on this project, I was using Tor Browser as is, and I didn't have detailed technical knowledge on how the whole system works in detail. I only had a rough idea of Tor works, and my knowledge about the Tor Browser & Tor software grew pretty much organically as I ask questions on IRC, read the spec files, and code. As you have already guessed, I made a few bad decisions at the beginning of the project because of my initial limited knowledge of Tor's inner workings.
For example, initially, I decided to use relays' OR addresses to index them in the database, and I thought all relays use their OR addresses as their exit addresses. Later, I learned that it is not a good idea to use OR addresses for indexing, and I switched to using relay fingerprints. I needed to edit or remove some parts of the codebase to make this change.
Another example is my initial tool selection. I underestimated the expansion of my project and started with a modest SQLite database to store the data I collect. It was doing an OK job until I passed the 1gb threshold, added the web API, and parallel web page fetchers. My database needed to handle long simultaneous connections, and it turned out to be very problematic with SQLite. I solved these issues by switching to PostgreSQL, but once again, I needed to edit the code to make this change. Luckily, I was expecting to have this upgrade at some point in the future (but not during the GSoC period), and I built the database connection class modular. So, I only needed to edit that class, and the rest of the code worked just fine.
So, if I did it all again, I would read all of the spec files, learn more about how things work in detail, and better plan the project's future trajectory before starting to code. That said, I learn better when I see things in action, and I would probably end up making similar mistakes in other ways. I guess that is a part of the learning experience :)
What is left and next?
I pretty much finished everything I planned to work on (see roadmap). I'm still working on the second version of the dashboard (see #41). I was expecting to do minor revisions on the dashboard, but a fundamental change turned out to be a necessity after the feedback I received from the community. So, making that many changes to the dashboard wasn't a part of the anticipated roadmap for GSoC. I still wanted to finish these changes during the GSoC period but once again I underestimated the complexity of the new changes. So, I plan to finish working on the v2 dashboard in September. Later, I will ask for feedback from the community and add new things based on the feedback.
Also, I need to finish documenting the CAPTCHA Monitor Core code. I use a lot of comments while I code, and the code is already documented in that sense. However, I need to finish writing the docstrings that explain the arguments and return values of each function.
Finally, I will work on the Tor Metrics (see #tpo/metrics/website/40002) integration. I'm committed to working on this project, and I'm not planning to stop until we achieve all of the expected long-term impact agenda. Probably new items will be added to the agenda as well.
I want to acknowledge my mentors Georg Koppen (@gk) and Roger Dingledine (@arma) for being very helpful, tirelessly answering my questions all the time, and guiding me to figure out pieces of this puzzle. I wouldn't learn as much as learned today without you, thank you both!
And finally, a huge thanks to the folks, who replied to my questions in IRC, those replies were very important for me to correct my errors and extend my knowledge.
Can I get involved in the development?
Yes, please! You can take a look at the contributing section, message me, or create an issue.
How can I contact you?
Thank you for your interest, please take a look at the contact section.