|
|
# Google Summer of Code 2021:
|
|
|
|
|
|
##### Final Report, Apratim Ranjan Chakrabarty
|
|
|
## Project : **[Alexa Top Sites Captcha and Tor Block Monitoring (Captcha Monitor)](https://community.torproject.org/gsoc/alexa-captcha-monitoring/)**
|
|
|
|
|
|
This page contains in details the final report for the project: [Alexa Top Sites Captcha and Tor Block Monitoring (Captcha Monitor)](https://summerofcode.withgoogle.com/projects/#4883883176230912) and if you want to further look into the latest development and updates of the project, you could revert back to the [GSoC 2021 wiki page](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021) and if you have queries regarding the project you could even visit the [FAQs page](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021/Faqs).
|
|
|
|
|
|
### Introduction:
|
|
|
|
|
|
The Project focuses on tracking top websites from alexa/moz500 ranking
|
|
|
and aims to get a detailed knowledge of the websites that are partially blocking,
|
|
|
fully blocking or returning Captchas or even websites limiting
|
|
|
functionalities to the Tor users/clients/relays. The results are then collected, analyzed and are used to provide answers to different metrics related questions, further forming a basic understanding as to how websites are blocking Tor.
|
|
|
|
|
|
### Work Done:
|
|
|
|
|
|
The project focuses on two major parts:
|
|
|
+ **The Analysis Part**
|
|
|
+ **The Dashboard Part**
|
|
|
----
|
|
|
#### The Analysis Part:
|
|
|
|
|
|
As the name suggests it is the Analysis Module, the brain of the project. It categories websites if it either blocks tor completely, partially, returns captchas or doesn't discriminate against Tor and parse the results into the `AnalyzeCompleted` table which would further be queried to get insights on, and for the visualization purpose.
|
|
|
|
|
|
The code consists of basically three main checks:
|
|
|
+ **Status Check:** It checks for the status code in HTTP response code. Thereby, can detect if a website is blocked completely or not.
|
|
|
|
|
|
+ **Dom Analysis:** It checks for the structural differences between the Tor exit nodes and control nodes, and thereby approximating the results. Helps in finding those pages that block tor partially.
|
|
|
|
|
|
+ **Captcha Checker:** Checks if website returns Captcha to Tor exit nodes or not or even both!
|
|
|
|
|
|
Further, there is another module: `Consensus Lite Module` - This module uses another extra check, i.e., proxies and further compares to Tor exit nodes and control nodes. Thereby, increasing the value of surety as more the number of vantage points more is the data collected, enabling the system to work better.
|
|
|
|
|
|
The MR can be seen [here](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/master/src/captchamonitor/core/analyzer.py). The flowchart describing it can be referred [here](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021#updated-logic)
|
|
|
|
|
|
#### The Dashboard Part:
|
|
|
|
|
|
This is the part that will be displayed in the frontend side. So, it gets hold of multiple tables such as `FetchCompleted`, `AnalyzeCompleted`, `Relay` etc to get hold of the different information about the analyzed data and can render and show it to the public. Currently, the dashboard shows the output **Graph according to individual Relay ids** and also metrics in table format below. I've also opened a MR which would output the **Data according to the Website ids**.
|
|
|
|
|
|
_Dashboard:_
|
|
|
|
|
|
![image](uploads/10f52cc7c680272cf747e1adc40c5d37/image.png)
|
|
|
|
|
|
_Individual Relay:_
|
|
|
|
|
|
![image](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/uploads/38c09e80f33b882260ad5a8eaadbf88a/image.png)
|
|
|
|
|
|
|
|
|
##### In short:
|
|
|
Since, the previous work done during [GsoC 2020](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2020-Home) aims at tracking the different CDNs. I reused the `Fetcher Working Module`, the Relay list from the Consensus and added `The Analysis Module`, `Consensus Module`, `Domain List`.
|
|
|
|
|
|
Further, I used `Jinja` templating engine to get the data from the backend and send it to the frontend part of the code. Fore even more information regarding the work done you could refer to the [Roadmap](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021#nearby-goals-and-roadmap)
|
|
|
and for the design you could refer to the [Architecture](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/home#architecture)
|
|
|
|
|
|
|
|
|
#### Links:
|
|
|
+ [**Commits**](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/commits/master?author=hackhard)
|
|
|
+ **Major MRs**:
|
|
|
- https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/commit/481efc849082ad0848e19c732223eecf8fb7b229
|
|
|
- https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/commit/d2eaa179168af8256cc55bec42dccc8ce55e0aed
|
|
|
- https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/commit/2cd368ac2b4730569faa2eb593bbc1d9d0211cfa
|
|
|
|
|
|
#### Findings:
|
|
|
|
|
|
Sometimes, a website like https://dan.me.uk/ isn't able to detect a tor connection and hence allows it. My guesses are: the particular Tor exit nodes isn't yet added to the website's blocklist, or the website dynamically just blocks those nodes that transmits data attacks like DDoS or too many requests are coming from a particular exit relay, or there's a probability that the exit relay has an outgoing connection from a different IP that the [blocklist](https://www.dan.me.uk/torlist/?exit).
|
|
|
|
|
|
I also learnt that, there are few websites like [Wikipedia](https://en.wikipedia.org) that restricts certain features (Wikipedia doesn't allow posting) of the website from Tor users. A list could be found [here](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/issues/65#note_2743646). I couldn't come up with an automated solution to these websites so for now I'm doing these manually.
|
|
|
|
|
|
#### Future Works:
|
|
|
|
|
|
There are still some deliverables that I couldn't complete that I would work on, and will also work on the remaining issues related to docker. The task I would focus in future are:
|
|
|
- [ ] Probability for Getting Tor Discrimination #101
|
|
|
- [ ] Dockers utilizing too much space when kept running #102
|
|
|
- [ ] Proxy vs Relay countries #98 (Use html similarity or similar modules that could be more powerful to dom_analyze)
|
|
|
- [ ] Errors, Bugs and Edge Cases #94
|
|
|
- [ ] Add a utility for cleaning up dead tor containers #95
|
|
|
- [ ] Better UI/UX for the dashboard.
|
|
|
|
|
|
#### Making it possible:
|
|
|
|
|
|
All of this would not have been possible without the great support and untiring efforts of my mentors @woswos and @Geko without which things would have been quite difficult, the people at #tor-dev and the Tor Mail list, especially @woswos for reviewing my codes with much patience and answering to all my queries. Also, this would not have been possible without Google Summer of Code.
|
|
|
|
|
|
Thank you all for this amazing learning experience! |
|
|
\ No newline at end of file |