Skip to content

Add GDPR remover to fetchers

Barkin Simsek requested to merge gdpr into master

This merge request adds GDPR remover code to the base fetcher based on @hackhard's approach from https://github.com/Hackhard/Fetcher/blob/main/analyze_modify.py#L229-L244

This MR is not finalized yet since it requires changes in other modules to work completely. I didn't want to apply the remover to all fetch jobs since it would take extra time for no reason. Instead, the worker will decide to activate this functionality based on the continent code of the exit relay (this information is already stored in the relay table). If the relay is in Europe, then the worker will turn on the functionality.

However, one of the issues @hackhard faced was the websites that redirect to other URLs upon clicking the accept buttons (like yahoo). Placing a sleep statement with a default value is wasteful and not dynamic enough to wait for these changes. Some websites might take more time to completely load the page, and I realized that the current_url property of the selenium driver is updated once the page is completely loaded. Thus, we can wait until the URL changes to completely load the page after clicking the accept button.

Now, we need to know which domains have this behavior (redirecting to another URL after accepting GDPR stuff) to make this ^ work, because we can't also wait for the URL to change if the URL is not supposed to be changing. I think we need another utility to determine this while adding domains to the database.

I can keep working on making these details happen. What do you think @hackhard? I wanted to get your thoughts.

Edited by hackhard

Merge request reports