Add GDPR remover to fetchers
This merge request adds GDPR remover code to the base fetcher based on @hackhard's approach from https://github.com/Hackhard/Fetcher/blob/main/analyze_modify.py#L229-L244
This MR is not finalized yet since it requires changes in other modules to work completely. I didn't want to apply the remover to all fetch jobs since it would take extra time for no reason. Instead, the worker will decide to activate this functionality based on the continent
code of the exit relay (this information is already stored in the relay
table). If the relay is in Europe, then the worker will turn on the functionality.
However, one of the issues @hackhard faced was the websites that redirect to other URLs upon clicking the accept buttons (like yahoo). Placing a sleep statement with a default value is wasteful and not dynamic enough to wait for these changes. Some websites might take more time to completely load the page, and I realized that the current_url
property of the selenium driver is updated once the page is completely loaded. Thus, we can wait until the URL changes to completely load the page after clicking the accept button.
Now, we need to know which domains have this behavior (redirecting to another URL after accepting GDPR stuff) to make this ^ work, because we can't also wait for the URL to change if the URL is not supposed to be changing. I think we need another utility to determine this while adding domains to the database.
I can keep working on making these details happen. What do you think @hackhard? I wanted to get your thoughts.