... | @@ -4,13 +4,94 @@ _Following blog posts are mirrored from [DIAL's blog](https://hub.osc.dial.commu |
... | @@ -4,13 +4,94 @@ _Following blog posts are mirrored from [DIAL's blog](https://hub.osc.dial.commu |
|
|
|
|
|
[[_TOC_]]
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
# July 2020
|
|
|
|
## July 3
|
|
|
|
This week started with an unexpected issue. The CAPTCHA rates I was getting were very high when compared to what Tor Browser users experience in real life. After investigating, I realized that the seleniumwire library I used to capture HTTP headers was causing this issue. Interestingly, this was the case only with Tor. I wasn't getting high rates of CAPTCHA when I used seleniumwire with regular internet. Clearly, using seleniumwire and Tor together triggers something on the Cloudflare side. I think they might be detecting the increased latency or the changed TLS fingerprint.
|
|
|
|
|
|
|
|
Anyway, I opted out using seleniumwire because it was affecting the results negatively. I started using the [HTTP-Header-Live](https://github.com/Nitrama/HTTP-Header-Live) addon for capturing the headers. The addon starts automatically when the browser starts and captures the headers inside of the browser without touching to the traffic itself. When the page is completely loaded, the addon writes the headers to a text file in JSON format. Later, my code reads this file saves the results. It is not the most elegant way to solve this problem but I needed to use this method since the elegant method (seleniumwire) caused problems.
|
|
|
|
|
|
|
|
Here is a sample of the code I used to connect Tor Browser to the Tor network via seleniumwire. Feel free to do further testing, if this issue sounds interesting to you.
|
|
|
|
https://gist.github.com/woswos/38b921f0b82de009c12c6494db3f50c5
|
|
|
|
|
|
|
|
After solving this unexpected problem, I worked on adding support for older versions of the browsers. Now, `-b` or `--browser_version` flag can be used to provide the exact browser version. The code doesn't automatically download that version of the browser but it can be a nice future addition.
|
|
|
|
|
|
|
|
I also realized that Cloudflare injects code that wasn't a part of the original page. For example here is the original code:
|
|
|
|
```
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
|
|
<title>Hello world!</title>
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
Hello world!
|
|
|
|
</body>
|
|
|
|
</html>
|
|
|
|
```
|
|
|
|
|
|
|
|
Here is the version Cloudflare serves:
|
|
|
|
```
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
|
|
<title>Hello world!</title>
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
Hello world!
|
|
|
|
<script defer="" src="https://static.cloudflareinsights.com/beacon.min.js" data-cf-beacon="{"rayId":"5a974a483cf0b6cc","version":"2020.5.1","si":10}">. </script>
|
|
|
|
</body>
|
|
|
|
</html>
|
|
|
|
```
|
|
|
|
So, I decided to detect these kinds of changes as well by hashing the page. Now, the system automatically takes the MD5 hash of the page contents and compares it with the original hash. If there is a change, it also saves that change.
|
|
|
|
|
|
|
|
Additionally, I created a new section called 'Measurement Search' for showing the individual measurements that from the graphs. It also enables users to perform custom queries on the data using the search box:
|
|
|
|
![cb65faad694aecd7df1a020f6a3986e7fe65f85e_2_1380x876](uploads/711c767feae62e5c7452163d1527eadb/cb65faad694aecd7df1a020f6a3986e7fe65f85e_2_1380x876.png)
|
|
|
|
![82ef3b3f90a738a93042d30a5041a4337479745c_2_1380x876](uploads/b7c5c9d8aed4e9eb8a7d2372e1316d35/82ef3b3f90a738a93042d30a5041a4337479745c_2_1380x876.jpeg)
|
|
|
|
|
|
|
|
|
|
|
|
# June 2020
|
|
|
|
## June 26
|
|
|
|
This week I spent my time parallelizing the CAPTCHA Monitor using processes on the host machine. Previously I was using Docker swarm to replicate the instances of the code, but it turned out to be slow and memory consuming. Instead, I used Python's *multiprocessing* library to replicate the workers. I needed to make a few changes in the architecture to make this happen. I needed to separate the code that manages Tor and Tor Browser from the main program loop. Now, the main program loop creates instances of that code in separate processes and makes sure that they keep running. By using the updated code, I started collecting data one more time. Every day I collect data for a different metric.
|
|
|
|
|
|
|
|
The next step was to display the collected data in a dashboard. You might remember that I mentioned a dashboard already and put a screenshot of it. Actually, that was the second dashboard solution I tried. In the very beginning, I tried using [Graphana](https://grafana.com/). It is a really neat open source dashboard solution, and it has well-designed layout options. These are all great features, but Graphana is geared towards time series data like the temperature of a CPU or amount of ram usage of a computer. So, the data sources and the backend are designed for that kind of data. It also doesn't provide flexibility with data manipulation. Grafana wants to display what the database query returns directly on the dashboard. Unfortunately, I needed more flexibility in the way I process data, and I needed to combine multiple queries sometimes. Still, I used Graphana for a while to see if I was wrong and I wasn't wrong.
|
|
|
|
|
|
|
|
I did further research, and I found Metabase, which is another open-source dashboard solution. As opposed to Graphana, Metabase had all the flexibility I needed in the backend to process data before showing them on the dashboard. I really liked using Metabase, but it had a lot of flaws on the frontend. For example, some of the graphs were clipped for no reason, and there was no option for fixing it. It was also consuming a lot of memory on my VPS, and I thought I could use that memory for data collection rather than spending on the dashboard for no solid reason.
|
|
|
|
|
|
|
|
So, I ended up building my own dashboard using Node.js, Bootstrap, Chart.js, and Express.js:
|
|
|
|
|
|
|
|
![64f18f46f492cb7ee5d4f8fd6ce3e03e58cd7401_2_1380x876](uploads/76a14b5e5f931b40898fdf7b2f775cc6/64f18f46f492cb7ee5d4f8fd6ce3e03e58cd7401_2_1380x876.png)
|
|
|
|
|
|
|
|
I used my learnings from my weeks of dashboard search to create something simple and elegant. I used Node.js & Express.js on the backend to create an API and Bootstrap & Chart.js on the front end for displaying data. The cool thing is I can process the data in the way I want on the backend and send it to the dashboard through API. If I don't like anything about the frontend, I can just change it! Sure, I could do changes in the other open-source dashboard solutions as well, but I needed to go through an unnecessary amount of steps to achieve it. Also, now I can use the same backend API solution for other purposes. I was already planning to have an API for third parties to fetch data from the system, and there I have it!
|
|
|
|
|
|
|
|
Finally, I spent some time moving my project to Tor Project's new GitLab server. Previously, code, issue tracker, and wiki page were all on different locations. Now, they are all in the same place and unified. GitLab also have a lot of extra productivity tools, and I can't wait to use them. Here is the new home for my code: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor
|
|
|
|
|
|
|
|
|
|
|
|
## June 19
|
|
|
|
For the first time, I encountered problems regarding the speed of my code, and I'm glad that it happened. So, I could learn how to make it run faster. I need to perform daily measurements on the Tor exit relays as a part of my project, and there are many of them. I repeat the same exact measurement over and over again and compare the results. With this scale, every extra second in the execution of the individual measurements, cause an extra ~25 minute execution time overall.
|
|
|
|
|
|
|
|
When I first started, the total execution time was well over 100 hours, and we have 24 hours in a day. This week I worked on implementing a worker pool to run many operations in parallel. The worker pool system helped me to reduce the total execution time significantly (down to 40 hours), but still, it is not enough. Later, I started looking at other similar projects like [exitmap](https://github.com/NullHypothesis/exitmap) to see how they handle the measurement. This was helpful as well, and I implemented my learnings from these projects, but I still need to reduce the total execution time a lot.
|
|
|
|
|
|
|
|
The biggest bottleneck is the web browser itself. Currently, every individual measurement takes ~10 seconds, and ~7 seconds of this belongs to the web browser getting started. I hope to cut down the individual measurement time to ~5. If that is not possible, I will try to find ways to run more workers in parallel more efficiently.
|
|
|
|
|
|
|
|
|
|
|
|
## June 5
|
|
|
|
This week I spent my time getting the first version of the system up and running. I deployed a continuously running instance of my code to my server, and I connected the database to the [dashboard](https://dashboard.captcha.wtf/). I also worked on adding a few meaningful graphs to the dashboard.
|
|
|
|
|
|
|
|
![4c282592baa3db2793baec7fb3979c35ca053125_2_1380x876](uploads/182972c762e4cf2ccc67e07f1378fa17/4c282592baa3db2793baec7fb3979c35ca053125_2_1380x876.png)
|
|
|
|
|
|
|
|
I communicated with my mentors to make sure that I am on track and to get feedback on the dashboard. Based on the feedback I received, I will update the dashboard and the way I collect data with my code.
|
|
|
|
|
|
|
|
Meanwhile, I integrated Tor Stem into the system, and now I can specify a Tor exit node for testing purposes. I also merged the code that I have been restructuring into master, and I updated the README file to reflect the changes. Now, I'm working on integrating the Cloudflare API, and I plan to finish implementing it this weekend.
|
|
|
|
|
|
|
|
As you may have realized, I love flowcharts & diagrams, and I made another one to explain the current state of my code :) Actually, the code doesn't ask for these details "step by step" and I enter all of these details all at once at the beginning, instead. That being said, I believe, breaking down the process into smaller steps help us, humans, to understand what is going on better.
|
|
|
|
|
|
|
|
![d1a019ec874505130ac545655ad1b508a0fa33e1_2_1380x330](uploads/732c72dec459acef29358042492f8d45/d1a019ec874505130ac545655ad1b508a0fa33e1_2_1380x330.png)
|
|
|
|
|
|
|
|
|
|
# May 2020
|
|
# May 2020
|
|
## May 29
|
|
## May 29
|
|
**Community Bonding Period - Week 4**
|
|
|
|
|
|
|
|
This week I restructured the preliminary code I had previously. I did this to make it work as I explained in my project diagram below. The changes I implemented made it possible to easily download the database.
|
|
This week I restructured the preliminary code I had previously. I did this to make it work as I explained in my project diagram below. The changes I implemented made it possible to easily download the database.
|
|
|
|
|
|
|
|
![3f63c7c30e9bc429ff103572d98357ad3031350f_2_1034x616](uploads/ca2d7a32cf51ef8bb67a55e781425279/3f63c7c30e9bc429ff103572d98357ad3031350f_2_1034x616.png)
|
|
|
|
|
|
Later, I worked on making the code reliable because it wasn't always working in the "headless" mode. There was an undocumented dependency problem in the [tor-browser-selenium ](https://github.com/webfp/tor-browser-selenium) library that I was using. I needed to install the Firefox browser to reliably use the library. I don't think it is related to having "Firefox" installed on the system but I think it is related to having a piece of code Firefox installs. It took a long time to figure this out and I will raise this issue in the library's GitHub repository to further investigate with the maintainers.
|
|
Later, I worked on making the code reliable because it wasn't always working in the "headless" mode. There was an undocumented dependency problem in the [tor-browser-selenium ](https://github.com/webfp/tor-browser-selenium) library that I was using. I needed to install the Firefox browser to reliably use the library. I don't think it is related to having "Firefox" installed on the system but I think it is related to having a piece of code Firefox installs. It took a long time to figure this out and I will raise this issue in the library's GitHub repository to further investigate with the maintainers.
|
|
|
|
|
... | | ... | |