|
|
# WARNING: This page is a working draft
|
|
|
|
|
|
# Welcome to CAPTCHA Monitoring project's wiki!
|
|
|
This wiki page contains the final report for the "Tor Project: Cloudflare CAPTCHA Monitoring" project for Google Summer of Code 2020. It is a broad overview of the work completed during GSoC period and you can take a look at the [home wiki page](home) for more detailed & latest information.
|
|
|
This wiki page contains the final report for the "Tor Project: Cloudflare CAPTCHA Monitoring" project for Google Summer of Code 2020. It is a broad overview of the work completed during the GSoC period, and you can take a look at the [home wiki page](home) for more detailed & latest information.
|
|
|
|
|
|
|
|
|
## What is this project about?
|
|
|
The **CAPTCHA Monitoring** project aims to track how often CDN (for ex. Cloudflare, Akamai, Amazon Cloudfront, etc.) fronted webpages return CAPTCHAs to Tor clients. The project aims to achieve this by fetching webpages via both Tor and other mainstream web browsers and comparing the results. The tests are repeated periodically to find the patterns over time. Collected metadata, metrics, and results are analyzed and displayed on a dashboard to understand how CDN providers manipulate internet traffic and affect people's access to the internet.
|
|
|
|
|
|
Here is a diagram explaining project's workflow at a high level:
|
|
|
|
|
|
## What work has been completed during the GSoC period?
|
|
|
### Background
|
|
|
I have been personally annoyed by receiving CAPTCHAs while using Tor, and going through the Tor Project's issue tickets showed that I wasn't alone in this, especially ticket [#33010](https://gitlab.torproject.org/tpo/metrics/ideas/-/issues/33010). After years of complaints from users and research papers published on the topic, it was clear that a public database & data collection tool was needed to back up the claims and let CDN companies take action. So, the CAPTCHA Monitor was born. Since this project didn't exist before, I designed the whole system and built it during GSoC. The designs of other similar systems, such as [OONI](https://ooni.org/), [Tor Metrics](https://metrics.torproject.org/), and [ExitMap](https://github.com/NullHypothesis/exitmap/), were influential in the decisions I made.
|
|
|
|
|
|
Here is a high-level overview of the design I implemented:
|
|
|
```mermaid
|
|
|
%% Please enable JavaScript to see this flowchart
|
|
|
|
... | ... | @@ -39,13 +44,7 @@ flowchart LR |
|
|
api --> public
|
|
|
```
|
|
|
|
|
|
|
|
|
## What work has been completed during GSoC period?
|
|
|
### Background
|
|
|
Started from scratch
|
|
|
|
|
|
https://gitlab.torproject.org/tpo/metrics/ideas/-/issues/33010
|
|
|
|
|
|
There are five separate repositories dedicated to different parts of the system. I will explain the work completed for each repository separately, and here you can see the hierarchy of the repositories:
|
|
|
```
|
|
|
CAPTCHA Monitor
|
|
|
|-- Core
|
... | ... | @@ -84,21 +83,21 @@ The weekly blog posts that were posted and emails sent to the Tor mailing lists |
|
|
|
|
|
|
|
|
## What you would do differently if you did it all again?
|
|
|
Before starting to work on this project, I was using Tor Browser as is and I didn't have detailed technical knowledge on how the whole system works in detail. I only had a rough idea of Tor works and my knowledge about the Tor Browser & Tor software grew pretty much organically as I ask questions on IRC, read the spec files, and code. As you have already guessed, I made a few bad decisions at the beginning of the project because of my initial limited knowledge on the inner workings of Tor.
|
|
|
Before starting to work on this project, I was using Tor Browser as is, and I didn't have detailed technical knowledge on how the whole system works in detail. I only had a rough idea of Tor works, and my knowledge about the Tor Browser & Tor software grew pretty much organically as I ask questions on IRC, read the spec files, and code. As you have already guessed, I made a few bad decisions at the beginning of the project because of my initial limited knowledge of Tor's inner workings.
|
|
|
|
|
|
For example, initially, I decided to use relays' OR addresses to index them in the database and I thought all relays use their OR addresses as their exit addresses. Later, I learned that it is not a good idea to use OR addresses for indexing and I switched to using relay fingerprints. I needed to edit or remove some parts of the codebase to make this change.
|
|
|
For example, initially, I decided to use relays' OR addresses to index them in the database, and I thought all relays use their OR addresses as their exit addresses. Later, I learned that it is not a good idea to use OR addresses for indexing, and I switched to using relay fingerprints. I needed to edit or remove some parts of the codebase to make this change.
|
|
|
|
|
|
Another example is my initial tool selection. I underestimated the expansion of my project and started with a modest SQLite database to store the data I collect. It was doing an OK job until I passed the 1gb threshold, added the web API, and parallel web page fetchers. My database needed to handle long simultaneous connections and it turned out to be very problematic with SQLite. I solved these issues by switching to PostgreSQL but once again I needed to edit the code to make this change. Luckily, I was expecting to have this upgrade at some point in the future (but not during the GSoC period) and I built the database connection class modular. So, I only needed to edit that class and the rest of the code worked just fine.
|
|
|
Another example is my initial tool selection. I underestimated the expansion of my project and started with a modest SQLite database to store the data I collect. It was doing an OK job until I passed the 1gb threshold, added the web API, and parallel web page fetchers. My database needed to handle long simultaneous connections, and it turned out to be very problematic with SQLite. I solved these issues by switching to PostgreSQL, but once again, I needed to edit the code to make this change. Luckily, I was expecting to have this upgrade at some point in the future (but not during the GSoC period), and I built the database connection class modular. So, I only needed to edit that class, and the rest of the code worked just fine.
|
|
|
|
|
|
So, if I did it all again, I would read all of the spec files, learn more about how things work in detail, and better plan the project's future trajectory before starting to code. That said, I learn better when I see things in action and I would probably end up making similar mistakes in other ways. I guess that is a part of the learning experience :)
|
|
|
So, if I did it all again, I would read all of the spec files, learn more about how things work in detail, and better plan the project's future trajectory before starting to code. That said, I learn better when I see things in action, and I would probably end up making similar mistakes in other ways. I guess that is a part of the learning experience :)
|
|
|
|
|
|
|
|
|
## What is left and next?
|
|
|
I pretty much finished everything I planned to work (see [roadmap](home#roadmap)). I'm still working on the second version of the dashboard (see #41). I was expecting to do minor revisions on the dashboard but a fundamental change turned out to be a necessity after the feedback I received from the community. So, making that many changes to the dashboard wasn't a part of the anticipated roadmap for GSoC. I still wanted to finish these changes during the GSoC period but once again I underestimated the complexity of the new changes. So, I plan to finish working on the v2 dashboard in September. Later, I will ask for feedback from the community and add new things based on the feedback.
|
|
|
I pretty much finished everything I planned to work (see [roadmap](home#roadmap)). I'm still working on the second version of the dashboard (see #41). I was expecting to do minor revisions on the dashboard, but a fundamental change turned out to be a necessity after the feedback I received from the community. So, making that many changes to the dashboard wasn't a part of the anticipated roadmap for GSoC. I still wanted to finish these changes during the GSoC period but once again I underestimated the complexity of the new changes. So, I plan to finish working on the v2 dashboard in September. Later, I will ask for feedback from the community and add new things based on the feedback.
|
|
|
|
|
|
Also, I need to finish documenting the CAPTCHA Monitor Core code. I use a lot of comments while I code and the code is already documented in that sense. However, I need to finish writing the docstrings that explain the arguments and return values of each function.
|
|
|
Also, I need to finish documenting the CAPTCHA Monitor Core code. I use a lot of comments while I code, and the code is already documented in that sense. However, I need to finish writing the docstrings that explain the arguments and return values of each function.
|
|
|
|
|
|
Finally, I will work on the [Tor Metrics](https://metrics.torproject.org/) (see [#tpo/metrics/website/40002](https://gitlab.torproject.org/tpo/metrics/website/-/issues/40002)). I'm committed to working on this project and I'm not planning to stop until we achieve all of the [expected long-term impact](home#expected-long-term-impact) agenda. Probably new items will be added to the agenda as well.
|
|
|
Finally, I will work on the [Tor Metrics](https://metrics.torproject.org/) (see [#tpo/metrics/website/40002](https://gitlab.torproject.org/tpo/metrics/website/-/issues/40002)). I'm committed to working on this project, and I'm not planning to stop until we achieve all of the [expected long-term impact](home#expected-long-term-impact) agenda. Probably new items will be added to the agenda as well.
|
|
|
|
|
|
|
|
|
## Acknowledgments
|
... | ... | |