... | @@ -20,7 +20,7 @@ The project focuses on two major parts: |
... | @@ -20,7 +20,7 @@ The project focuses on two major parts: |
|
----
|
|
----
|
|
#### The Analysis Part:
|
|
#### The Analysis Part:
|
|
|
|
|
|
As the name suggests it is the Analysis Module, the brain of the project. It categories websites if it either blocks tor completely, partially, returns captchas or doesn't discriminate against Tor and parse the results into the `AnalyzeCompleted` table which would further be queried to get insights on, and for the visualization purpose.
|
|
As the name suggests it refers to the Analysis Module and is the brain of the project. It categories websites if it either blocks tor completely, partially, returns captchas or doesn't discriminate against Tor and parse the results into the `AnalyzeCompleted` table which would further be queried to get insights on, and for the visualization purpose.
|
|
|
|
|
|
The code consists of basically three main checks:
|
|
The code consists of basically three main checks:
|
|
+ **Status Check:** It checks for the status code in HTTP response code. Thereby, can detect if a website is blocked completely or not.
|
|
+ **Status Check:** It checks for the status code in HTTP response code. Thereby, can detect if a website is blocked completely or not.
|
... | @@ -45,6 +45,24 @@ _Individual Relay:_ |
... | @@ -45,6 +45,24 @@ _Individual Relay:_ |
|
|
|
|
|
![image](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/uploads/38c09e80f33b882260ad5a8eaadbf88a/image.png)
|
|
![image](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/uploads/38c09e80f33b882260ad5a8eaadbf88a/image.png)
|
|
|
|
|
|
|
|
##### Extensive Details:
|
|
|
|
|
|
|
|
The above passage describes the work done and the modules integrated. I would further discuss the details as to why I choose the said steps..
|
|
|
|
|
|
|
|
The first [commit](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/2cc57e1572e486c848fa71a9cb14975c30994f0a/src/captchamonitor/utils/website_parser.py) I did was to add the utility feature of parsing the Alexa Top sites and Moz 500 websites to get the top 50 and 500 websites respectively. The code being modular helps in easy addition of new modules to get domains from different websites if required, or any changes to the existing code if the websites parsing the domains change over time. I learned the importance of modular coding.
|
|
|
|
|
|
|
|
Further, I went on implementing the Analysis Module and had plans of implementing the `Senser Module` (`Consensus Module`: which would render the website content from different vantage points and compare with the website returned by the tor exit node, a content based approach) on top of the low-hanging-fruitish approach of checking the HTTP response codes as my primary check but after discussion with my mentor I stuck to the Structural Method which checks for structural difference between websites. I basically implemented it by comparing the length of the nodes in DOM structure to get the approximate structure of the websites. I chose this approach over the content based approach because for dynamic websites like: reddit or news website there might be differences in the results shown to each user upon many factors which might be difficult to replicate and so the content might differ.
|
|
|
|
|
|
|
|
Therefore, I added a further module named `Consensus Lite Module` which checks for the structural difference but now has proxies added to the workflow. More like it checks for the difference in structures among proxies, tor exit nodes and control nodes and tries to check cases that might have not been identified by just the tor exit node and control nodes. Though, checking out the `AnalyzeCompleted` table it is seen that a major part is covered by the HTTP response checks and the Structural checks based on exit nodes and control nodes.
|
|
|
|
|
|
|
|
Next, since we started getting information out of the data, I started building the dashboard to view the metrics and the graphs. I first thought of using `plotly` and `dash` but a majority of Tor users might use Tor with the safest settings and disabled JS, so switched back to `Matplotlib` and `Jinja` for templating.
|
|
|
|
I planned on implementing the following:
|
|
|
|
+ _Graph according to the Relay ids:_ It contains the details of a `relay_id` and it getting blocked/partially blocked/not discriminated by the total number of websites fetched by the relay, with the details of which websites blocking/partially blocking/not discriminating against it in form of a JSON data, could be seen in the console.
|
|
|
|
|
|
|
|
A part of it does get converted to a table as I didn't know where to show the large amount of websites, also I've created individual pages for each relays with graph according to the timestamps vs percentage of the mentioned tag (x vs y).
|
|
|
|
|
|
|
|
+ _Graph according to the Website ids:_ It contains the details of the domains that combines the relays according to the countries the relay is a part of and further distinguishes it according to the timestamp and block type stating if the website is blocked or not or does it return `CAPTCHAs`. This too is the form of JSON and I plan to add it the dashboard and create webpages according to the website containing more details as to _Is this website punishing tor exits according to countries?_
|
|
|
|
|
|
|
|
|
|
##### In short:
|
|
##### In short:
|
|
Since, the previous work done during [GsoC 2020](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2020-Home) aims at tracking the different CDNs. I reused the `Fetcher Working Module`, the Relay list from the Consensus and added `The Analysis Module`, `Consensus Module`, `Domain List`.
|
|
Since, the previous work done during [GsoC 2020](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2020-Home) aims at tracking the different CDNs. I reused the `Fetcher Working Module`, the Relay list from the Consensus and added `The Analysis Module`, `Consensus Module`, `Domain List`.
|
... | @@ -62,10 +80,14 @@ and for the design you could refer to the [Architecture](https://gitlab.torproje |
... | @@ -62,10 +80,14 @@ and for the design you could refer to the [Architecture](https://gitlab.torproje |
|
|
|
|
|
#### Findings:
|
|
#### Findings:
|
|
|
|
|
|
Sometimes, a website like https://dan.me.uk/ isn't able to detect a tor connection and hence allows it. My guesses are: the particular Tor exit nodes isn't yet added to the website's blocklist, or the website dynamically just blocks those nodes that transmits data attacks like DDoS or too many requests are coming from a particular exit relay, or there's a probability that the exit relay has an outgoing connection from a different IP that the [blocklist](https://www.dan.me.uk/torlist/?exit).
|
|
Sometimes, a website like https://dan.me.uk/ isn't able to detect a tor connection and hence allows it. My guesses are: the particular Tor exit nodes isn't yet added to the website's blocklist, or the website dynamically just blocks those nodes that transmits data attacks like `DDoS` or too many requests are coming from a particular exit relay, or there's a probability that the exit relay has an outgoing connection from a different IP that the [blocklist](https://www.dan.me.uk/torlist/?exit).
|
|
|
|
|
|
I also learnt that, there are few websites like [Wikipedia](https://en.wikipedia.org) that restricts certain features (Wikipedia doesn't allow posting) of the website from Tor users. A list could be found [here](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/issues/65#note_2743646). I couldn't come up with an automated solution to these websites so for now I'm doing these manually.
|
|
I also learnt that, there are few websites like [Wikipedia](https://en.wikipedia.org) that restricts certain features (Wikipedia doesn't allow posting) of the website from Tor users. A list could be found [here](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/issues/65#note_2743646). I couldn't come up with an automated solution to these websites so for now I'm doing these manually.
|
|
|
|
|
|
|
|
#### Where I could improve on:
|
|
|
|
|
|
|
|
Though I did complete most of part if not all, the thing that hampered my rate was digging into piles of code and writing modules that would work on it and trying to understand multiple things at once. I overcame this by creating my demo working codes which yielded the necessary outputs thereby confirming substantial evidences and managed to integrate them which was a comparatively faster way and in this way I got to know more about the codebase using hands-on method, rather than learning more about the codebase first and then dig deeper.
|
|
|
|
|
|
#### Future Works:
|
|
#### Future Works:
|
|
|
|
|
|
There are still some deliverables that I couldn't complete that I would work on, and will also work on the remaining issues related to docker. The task I would focus in future are:
|
|
There are still some deliverables that I couldn't complete that I would work on, and will also work on the remaining issues related to docker. The task I would focus in future are:
|
... | @@ -75,10 +97,12 @@ There are still some deliverables that I couldn't complete that I would work on, |
... | @@ -75,10 +97,12 @@ There are still some deliverables that I couldn't complete that I would work on, |
|
- [ ] Errors, Bugs and Edge Cases #94
|
|
- [ ] Errors, Bugs and Edge Cases #94
|
|
- [ ] Add a utility for cleaning up dead tor containers #95
|
|
- [ ] Add a utility for cleaning up dead tor containers #95
|
|
- [ ] Better UI/UX for the dashboard.
|
|
- [ ] Better UI/UX for the dashboard.
|
|
- [ ] Discuss with People into the field (Micah Sherr)
|
|
- [ ] Discuss with People into the field what more could be added.
|
|
|
|
|
|
|
|
Finally, the contribution doesn't end here as the websites we see now might change in future, so I hope to be part of it even then and integrate it with ["community" version of Relay Search if not into the Tor Metrics Relay Search](https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40002) :)
|
|
|
|
|
|
#### Making it possible:
|
|
#### Making it possible:
|
|
|
|
|
|
All of this would not have been possible without the great support and untiring efforts of my mentors Barkin Simsek (@woswos) and Georg Koppen (@gk) without which things would have been quite difficult, the people at #tor-dev and the Tor Mail list, especially @woswos for reviewing my codes with much patience and answering to all my queries and and Roger Dingledine (@arma) for helping me out to get in contact with Micah Sherr. Also, this would not have been possible without Google Summer of Code and DIAL.
|
|
All of this would not have been possible without the great support and untiring efforts of my mentors Barkin Simsek (@woswos) and Georg Koppen (@gk) without which things would have been quite difficult, the people at #tor-dev and the Tor Mail list, especially @woswos for reviewing my codes with much patience and answering to all my queries and and Roger Dingledine (@arma) for helping me out to get in contact with key people and people at the irc for providing me valuable answers to my queries. Also, this would not have been possible without Google Summer of Code and DIAL.
|
|
|
|
|
|
Thank you all for this amazing learning experience! |
|
Thank you all for this amazing learning experience! |
|
|
|
\ No newline at end of file |