... | ... | @@ -6,12 +6,21 @@ _Following blog posts are mirrored from [DIAL's blog](https://hub.osc.dial.commu |
|
|
|
|
|
# July 2020
|
|
|
## July 3
|
|
|
This week I worked on solving the memory leak problem. I found the root cause and stopped the leak. I was using a timeout function while fetching the pages. It turns out the timeout value I used was shorter than it is supposed to be, and the timeout function wasn't invoking the right signals to properly kill the instances of the browser. So, I increased the timeout and added the right calls to shut down the browser instances properly. This solved the memory leak issue.
|
|
|
|
|
|
Next, I worked on the algorithm for deciding which test to run for exit relays. The algorithm compiles a list of measurements and checks if a given relay completed the list of measurements. If the measurements are not complete, the algorithm assigns one of the uncompleted measurements to the exit relay. If the given relay completed all measurements, the algorithm refreshes the oldest measurement. I plan to add priorities to the measurements to take this algorithm one step further and perform some more important measurements more frequently than others.
|
|
|
|
|
|
Finally, I worked on annotating the data using the CAPTCHA Monitor's versions. The main problem was that I didn't have properly defined versions. So, I needed to define the versions first using the merge requests I made. After that, I added the code which attaches to the version information to the results.
|
|
|
|
|
|
With the completion of this week, I completed a full month of coding, and it has been great so far. I managed to stick to the timeline I set and released a fully working version of the system. I didn't get a lot of CAPTCHAs with my system so far. I will be working on expanding the modules to track other metrics and test more websites for CAPTCHA throughout the next weeks.
|
|
|
|
|
|
# June 2020
|
|
|
## June 26
|
|
|
This week started with an unexpected issue. The CAPTCHA rates I was getting were very high when compared to what Tor Browser users experience in real life. After investigating, I realized that the seleniumwire library I used to capture HTTP headers was causing this issue. Interestingly, this was the case only with Tor. I wasn't getting high rates of CAPTCHA when I used seleniumwire with regular internet. Clearly, using seleniumwire and Tor together triggers something on the Cloudflare side. I think they might be detecting the increased latency or the changed TLS fingerprint.
|
|
|
|
|
|
Anyway, I opted out using seleniumwire because it was affecting the results negatively. I started using the [HTTP-Header-Live](https://github.com/Nitrama/HTTP-Header-Live) addon for capturing the headers. The addon starts automatically when the browser starts and captures the headers inside of the browser without touching to the traffic itself. When the page is completely loaded, the addon writes the headers to a text file in JSON format. Later, my code reads this file saves the results. It is not the most elegant way to solve this problem but I needed to use this method since the elegant method (seleniumwire) caused problems.
|
|
|
|
|
|
Here is a sample of the code I used to connect Tor Browser to the Tor network via seleniumwire. Feel free to do further testing, if this issue sounds interesting to you.
|
|
|
https://gist.github.com/woswos/38b921f0b82de009c12c6494db3f50c5
|
|
|
Here is a sample of the code I used to connect Tor Browser to the Tor network via seleniumwire. Feel free to do further testing, if this issue sounds interesting to you. https://gist.github.com/woswos/38b921f0b82de009c12c6494db3f50c5
|
|
|
|
|
|
After solving this unexpected problem, I worked on adding support for older versions of the browsers. Now, `-b` or `--browser_version` flag can be used to provide the exact browser version. The code doesn't automatically download that version of the browser but it can be a nice future addition.
|
|
|
|
... | ... | @@ -48,8 +57,7 @@ Additionally, I created a new section called 'Measurement Search' for showing th |
|
|
![82ef3b3f90a738a93042d30a5041a4337479745c_2_1380x876](uploads/b7c5c9d8aed4e9eb8a7d2372e1316d35/82ef3b3f90a738a93042d30a5041a4337479745c_2_1380x876.jpeg)
|
|
|
|
|
|
|
|
|
# June 2020
|
|
|
## June 26
|
|
|
## June 19
|
|
|
This week I spent my time parallelizing the CAPTCHA Monitor using processes on the host machine. Previously I was using Docker swarm to replicate the instances of the code, but it turned out to be slow and memory consuming. Instead, I used Python's *multiprocessing* library to replicate the workers. I needed to make a few changes in the architecture to make this happen. I needed to separate the code that manages Tor and Tor Browser from the main program loop. Now, the main program loop creates instances of that code in separate processes and makes sure that they keep running. By using the updated code, I started collecting data one more time. Every day I collect data for a different metric.
|
|
|
|
|
|
The next step was to display the collected data in a dashboard. You might remember that I mentioned a dashboard already and put a screenshot of it. Actually, that was the second dashboard solution I tried. In the very beginning, I tried using [Graphana](https://grafana.com/). It is a really neat open source dashboard solution, and it has well-designed layout options. These are all great features, but Graphana is geared towards time series data like the temperature of a CPU or amount of ram usage of a computer. So, the data sources and the backend are designed for that kind of data. It also doesn't provide flexibility with data manipulation. Grafana wants to display what the database query returns directly on the dashboard. Unfortunately, I needed more flexibility in the way I process data, and I needed to combine multiple queries sometimes. Still, I used Graphana for a while to see if I was wrong and I wasn't wrong.
|
... | ... | @@ -65,7 +73,7 @@ I used my learnings from my weeks of dashboard search to create something simple |
|
|
Finally, I spent some time moving my project to Tor Project's new GitLab server. Previously, code, issue tracker, and wiki page were all on different locations. Now, they are all in the same place and unified. GitLab also have a lot of extra productivity tools, and I can't wait to use them. Here is the new home for my code: https://gitlab.torproject.org/woswos/CAPTCHA-Monitor
|
|
|
|
|
|
|
|
|
## June 19
|
|
|
## June 12
|
|
|
For the first time, I encountered problems regarding the speed of my code, and I'm glad that it happened. So, I could learn how to make it run faster. I need to perform daily measurements on the Tor exit relays as a part of my project, and there are many of them. I repeat the same exact measurement over and over again and compare the results. With this scale, every extra second in the execution of the individual measurements, cause an extra ~25 minute execution time overall.
|
|
|
|
|
|
When I first started, the total execution time was well over 100 hours, and we have 24 hours in a day. This week I worked on implementing a worker pool to run many operations in parallel. The worker pool system helped me to reduce the total execution time significantly (down to 40 hours), but still, it is not enough. Later, I started looking at other similar projects like [exitmap](https://github.com/NullHypothesis/exitmap) to see how they handle the measurement. This was helpful as well, and I implemented my learnings from these projects, but I still need to reduce the total execution time a lot.
|
... | ... | |