Update GSoC 2021 authored by hackhard's avatar hackhard
Hello, I'll be updating the wiki for GSoC'21
\ No newline at end of file As of now **The Captcha Monitoring** project tracks how often CDN (for ex. Cloudflare, Akamai, Amazon Cloudfront, etc.) fronted webpages return CAPTCHAs to Tor clients, the details of which could be found in [here](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2020-Home).
This is a wiki to the GSoC 2021 branch which will be tracking the The Alexa Top 500 Websites, to give a detailed perspective of websites mentioned above, blocking or returning Captchas to Tor clients. The project aims to do so by fetching webpages over a period of time from both Tor clients/browsers and the non-tor browsers thereby comparing the results. The results will be then collected and provide answers to the different metrics, and also form an understanding to how websites are blocking Tor and affect the Internet freedom.
## Details :
#### The Overall Flowchart:
![image](https://user-images.githubusercontent.com/34208125/112788271-266bb400-9078-11eb-9a72-6932a6e7291d.png)
The main focus is to track the websites including [**Alexa Top 500 sites**](https://www.alexa.com/topsites) blocking Tor exit nodes fully or partially. So, a basic approach would be comparing the Http status codes, Headers (browser dependant or not), DOM Tree of a given website with Tor exit node and non-Tor exit node, but it might not always give the correct results. So I’ll be trying to categorize the results into cases and try to cover them all. Below I’m trying to broaden each path:
The non-Tor path is sort of a role-model path. We’ll compare the Tor path to this to find if there are any differences between the two paths, more extensively the tor exit nodes and browsers over tor to that of the non tor browsers. Since we are going to compare the information, we will save all the information that might help us with the comparison part. We are first going to fetch the Http headers, the http status codes, to get information of the websites (superficial-information) that might help to differentiate between the Tbb and Normal Browser without the need of scanning the whole website. We might achieve results easily for cases when `status_code(Tbb) != status_code(Nb)`:
Website Over Tor | Website not over Tor
--------|--------
![image](https://user-images.githubusercontent.com/34208125/112788793-451e7a80-9079-11eb-8ff3-812e4a942870.png) | ![image](https://user-images.githubusercontent.com/34208125/112788866-75661900-9079-11eb-928e-2083aac75f91.png)
{Fig 1.1}
Or even in cases like these:
Website over Tor | Website over non-Tor
--------|--------
![image](https://user-images.githubusercontent.com/34208125/112789352-882d1d80-907a-11eb-92ee-a012a2bc8bc6.png) | ![image](https://user-images.githubusercontent.com/34208125/112789391-9b3fed80-907a-11eb-90a6-88012df9c589.png)
{Fig 1.2}
If for some reasons we cannot differentiate using the superficial information. Let’s say that the Tbb and Nb return the same status codes `status_code(Tbb) == status_code(Nb)` and we cannot determine the differences.
We next move to other options like:
* Compare the _length of generated DOM_ from both Tbb and Nb, which might work for a case below:
Website over Tor | Website over non-Tor
--------|--------
![image](https://user-images.githubusercontent.com/34208125/112789546-f4a81c80-907a-11eb-8961-4563fb040236.png) | ![image](https://user-images.githubusercontent.com/34208125/112789554-f83ba380-907a-11eb-9192-f65e649e344d.png)
{Fig 1.3}
Here we can see that the status codes are `200`, but still there is a clear difference between the results, and for the same reason I’m planning to compare the length.
* Now, for such cases where there might be almost same length, we again cannot surely determine if the results are different. So we may approach it preparing the `consensus` of each website. _We parse the DOM elements into a tree type structure with hashes, called `Senser` and collect the structure from all different NB we have (Chromium, Firefox, cURL, Requests) such that we get the picture of the unblocked-website we are looking for, and then accordingly get the results:_
* If `Tbb ≅ Nb` we know the Tor isn’t blocked.
* If `Tbb != Nb` we know that the specific Tor fingerprint is blocked.
##### HIGH LEVEL FLOW-CHART: #####
![image](https://user-images.githubusercontent.com/34208125/112789927-d4c52880-907b-11eb-96da-706d7cd25ab9.png)
One could find in more details [here](https://hackhard.github.io/my-blog//My-Approach-29-03).
## Nearby Goals :
- [ ] Working of the Captcha Monitor if the website blocks Tor fully _(Response errors or difference in status codes)._
- [ ] Working of the Captcha Monitor if the website returns Captcha .
- [ ] Working of the Captcha Monitor if the website redirects to another error page *(without blocking it).*
- [ ] Working of the Captcha Monitor if the website isn't blocked, and works fine.
- [ ] Add websites with limited functionalities.
## Posts :
Meanwhile I'll be updating a blog which will further provide more details to the project, from the difficulties faced to the different approaches being taken to help with the understanding, documentation and easy contribution.
Blog: https://hackhard.github.io/my-blog/
## Contact :
If you have any queries, feedback regarding the project you could reach me on the tor channels: (the #tor-dev or #tor-project channels on [OFTC](https://webchat.oftc.net/?channels=tor) IRC). My IRC handle being **\_ranchak\_**
Also you can also reach me out at: <abishekhmjee(at)gmail(dot)com>