|
|
# WARNING: This page is a working draft
|
|
|
|
|
|
# Welcome to CAPTCHA Monitoring project's wiki!
|
|
|
This wiki page contains the final report for the "Tor Project: Cloudflare CAPTCHA Monitoring" project for Google Summer of Code 2020. It is a broad overview of the work completed during the GSoC period, and you can find more detailed & latest information in the [home wiki page](home).
|
|
|
|
... | ... | @@ -14,7 +12,7 @@ The **CAPTCHA Monitoring** project aims to track how often CDN (for ex. Cloudfla |
|
|
### Background
|
|
|
I have been personally annoyed by receiving CAPTCHAs while using Tor, and going through the Tor Project's issue tickets showed that I wasn't alone in this, especially ticket [#33010](https://gitlab.torproject.org/tpo/metrics/ideas/-/issues/33010). After years of complaints from users and research papers published on the topic, it was clear that a public database & data collection tool was needed to back up the claims and let CDN companies take action. So, the CAPTCHA Monitor was born. Since this project didn't exist before, I designed the whole system from scratch and built it during GSoC. The designs of other similar tools, such as [OONI](https://ooni.org/), [Tor Metrics](https://metrics.torproject.org/), and [ExitMap](https://github.com/NullHypothesis/exitmap/), were influential in the decisions I made.
|
|
|
|
|
|
Next, I compiled a list of related tickets & comments from Tor Project's bug tracking system (see [metrics to track section](home#metrics-to-track)) to understand which metrics are valuable to collect and what the community wants to learn. These findings helped me to further tune my design and build a [roadmap](home#roadmap).
|
|
|
Next, I compiled a list of related tickets & comments from Tor Project's bug tracking system (see [metrics to track section](home#metrics-to-track)) to understand which metrics are valuable to collect and what the community wants to learn. These findings helped me to tune my design further and build a [roadmap](home#roadmap).
|
|
|
|
|
|
Here is a high-level overview of the design I implemented:
|
|
|
```mermaid
|
... | ... | @@ -62,9 +60,9 @@ CAPTCHA Monitor |
|
|
### CAPTCHA Monitor Core
|
|
|
The core is responsible for performing the measurements, analyzing the results, and storage. The `compose` submodule periodically fetches the list of URLs from the database and fetches new exit relays from the consensus. Later, it schedules measurement jobs by using the URL and exit relay list. Meanwhile, the `run` submodule runs multiple workers in parallel to process the measurement jobs by letting Tor connect to the requested exit relays and fetching the URL via the Tor browser (or another web browser requested). Finally, the results are stored in the database. The code, issues, and documentation related to the CAPTCHA Monitor Core can be found in [this repository](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor).
|
|
|
|
|
|
Additionally, the CAPTCHA Monitor Core relies on two other repositories to function. The first one is [HTTP Header Live repository](https://gitlab.torproject.org/woswos/HTTP-Header-Live). It contains a modified version of the [HTTP Header Live web browser extension by Martin Antrag](https://github.com/Nitrama/HTTP-Header-Live). HTTP Header Live is an extension that supports both Firefox & Chromium and it records a copy of the HTTP headers while fetching pages. The modified version of the extension can interface with the CAPTCHA Monitor Core and export the headers in a certain JSON format.
|
|
|
Additionally, the CAPTCHA Monitor Core relies on two other repositories to function. The first one is [HTTP Header Live repository](https://gitlab.torproject.org/woswos/HTTP-Header-Live). It contains a modified version of the [HTTP Header Live web browser extension by Martin Antrag](https://github.com/Nitrama/HTTP-Header-Live). HTTP Header Live is an extension that supports both Firefox & Chromium, and it records a copy of the HTTP headers while fetching pages. The modified version of the extension can interface with the CAPTCHA Monitor Core and export the headers in a particular JSON format.
|
|
|
|
|
|
The other repository is [CAPTCHA Monitor Web repository](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor-Web). It contains the code for websites served by Cloudflare (see [domains used for testing](home#domains-used-for-testing) section) and Nginx configuration of the webserver. These websites are used during the measurements to test certain properties of the Cloudflare blocking algorithm.
|
|
|
The other repository is [CAPTCHA Monitor Web repository](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor-Web). It contains the code for websites served by Cloudflare (see [domains used for testing](home#domains-used-for-testing) section) and the Nginx configuration of the webserver. These websites are used during the measurements to test specific properties of the Cloudflare blocking algorithm.
|
|
|
|
|
|
|
|
|
### CAPTCHA Monitor API
|
... | ... | @@ -74,10 +72,13 @@ The API is responsible for serving the collected data over a RESTful API. It bot |
|
|
### CAPTCHA Monitor Dashboard
|
|
|
The dashboard is used for visualizing the analyzed data and for detecting anomalies in the trends. The code, issues, and documentation related to the CAPTCHA Monitor Dashboard can be found in [this repository](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor-Dashboard). A live instance of the dashboard can be accessed at [dashboard.captcha.wtf](https://dashboard.captcha.wtf) or [captchaufjq5m2i73up537pldaxnbp6rzcbdrzc7y5rlwtx3mwigznad.onion](http://captchaufjq5m2i73up537pldaxnbp6rzcbdrzc7y5rlwtx3mwigznad.onion/)
|
|
|
|
|
|
## Challenges
|
|
|
|
|
|
|
|
|
## Findings/Learnings
|
|
|
So far, I have observed that using the Tor Browser Bundle out of the box without changing its configurations doesn't lead to a high CAPTCHA rate on Cloudflare fronted websites (assuming the website owners don't explicitly block exit relays). That said, modifying the user-agent or any other modifications that deviate your browser's fingerprint from a typical Tor Browser user, significantly increases the chance of getting CAPTCHAs. For example, using the regular Firefox over Tor resulted in getting CAPTCHAs in ~90% of the measurements. I believe Cloudflare is very aggressive against the "Firefox over Tor" users because many people, unfortunately, use Chromium/Firefox + Selenium + Tor to scrape web pages and bypass IP-based rate limits.
|
|
|
|
|
|
Additionally, I observed that the TLS fingerprint has a significant role in whether someone gets a CAPTCHA or not. As a part of the project, I decided to capture the HTTP headers during measurements to understand how they affect the CAPTCHA rates. Initially, I was using a Python library called [seleniumwire](https://github.com/wkeeling/selenium-wire/) to capture the HTTP headers by intercepting the traffic between the Tor Browser and Tor. By doing this, I got a very high CAPTCHA rate, like 98% of the time. seleniumwire forwards the traffic transparently, but it has a different TLS fingerprint than Tor Browser. I figured out that the difference in the TLS fingerprints was triggering the MITM detection on the Cloudflare side, thus, resulting in very high CAPTCHA rates.
|
|
|
|
|
|
Interestingly, I tried using the exact same Tor Browser & seleniumwire setup, but without Tor and, practically, I didn't get any CAPTCHAs. I believe the MITM detection is more aggressive if the traffic is coming through an exit relay. So, I stopped using seleniumwire to capture headers because it didn't reflect what a real human Tor Browser user is usually experiencing and started using the [HTTP Header Live web browser extension by Martin Antrag](https://github.com/Nitrama/HTTP-Header-Live).
|
|
|
|
|
|
|
|
|
## Communications
|
... | ... | @@ -115,9 +116,4 @@ Yes, please! You can take a look at the [contributing section](home#contributing |
|
|
|
|
|
|
|
|
## How can I contact you?
|
|
|
Thank you for your interest, please take a look at the [contact section](home#contact).
|
|
|
|
|
|
|
|
|
## Conclusion
|
|
|
|
|
|
<!-- https://developers.google.com/open-source/gsoc/help/work-product --> |
|
|
\ No newline at end of file |
|
|
Thank you for your interest, please take a look at the [contact section](home#contact). |
|
|
\ No newline at end of file |