From main/docs/operations.md, I found this extremely interesting:
3330 unique two-label .onion domains were configured from 26937 unique sites. 13956 of those unique sites have the same Onion-Location configuration as Twitter, which likely means that they copied some of their HTML attributes.
I wondered if these sites were clones/phishing attempts of Twitter.
So, I tried to open a couple and I got redirected to https://x.com/someusername.
I had already seen this pattern: some people use a subdomain to redirect to their Twitter/X account.
So, is the scraper following these redirects and then associating them to the original URL?
If so, personally I think these cases should be ignored instead (or considered only if there's also an explicit Onion-Location header).
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Short answer. Yes, redirects are followed. And I agree with you that
onion-grab should not associate the initial URL's domain with the onion
address found via onion location on a different (redirected) domain.
I consider the fact that we follow redirects while associating any found
onion address from the redirected domain a bug. In hindsight (and after
thinking about this for a few minutes -- warning), it seems like our data
collection should have been based on rsp.Request.URL; not reqURL.
Luckily this doesn't affect our estimate on the number of onion addresses that
can be found via CT + Onion Location (3330) as we dedup; and it also doesn't
affect any of the follow-up measurements we did based on the dataset.
My current thinking of what needs to be done now:
Update the operations timeline to reflect that you found this bug and what
the impact is. And also call out this bug in the TL;DR we have at the top.
Leave the dataset as is, since it's already been collected using a
particular version of the software which has the undesired redirect behavior.
Fix the bug and create a new tag, so that any future measurements are not
affected in the same way. Feedback on if you think it sounds sensible to base
the domain on rsp.Request.URL would be much welcome. Or if you think it
would be a better behavior to not follow redirects. If we keep the redirect
but make the attribution to the right domain name, I'd first want to test
that such behavior won't introduce other bugs. For example, I'm unsure what
will be in the HTTP headers if both the site which redirects set
Onion-Location and then the redirect site sets Onion-Location as well. So,
I will need to do a bit of testing/debugging if we're keeping redirects.
Many thanks for looking at the dataset and spotting this! I won't be able to
make the above fixes right away due to attending PETS, but I'll take a stab at
making the appropriate updates and fixes when I'm back home sometime next week.
Leave the dataset as is, since it's already been collected using a particular version of the software which has the undesired redirect behavior.
Yep, I think that you can keep the current one around (in case someone used it for researches and needs to use the same dataset again), but suggest to switch to another one.
An updated version of the dataset would be great for future tasks, but anyone needing to quickly fix this could start from the current version of the dataset and check only if the sites that have already been grabbed perform a redirect.
Feedback on if you think it sounds sensible to base the domain on rsp.Request.URL would be much welcome.
I think it's what eventually the current consumers of Onion-Location will do, so it's a good idea.
The idea of checking also the redirects is very interesting, but I don't know if it's worth.
It seems to me a very specific case in a world that is already small.
I initially thought it's something that might happen for example with our build machines (tb-build-*.tpo, which redirect to www.tpo), but I noticed they don't actually have an onion site on their own.
Many thanks for looking at the dataset and spotting this! I won't be able to make the above fixes right away due to attending PETS, but I'll take a stab at making the appropriate updates and fixes when I'm back home sometime next week.
No worries!
I was curious to see some data for tpo/applications/tor-browser#42688 (orders of magnitude were enough for my purposes, so I didn't really write a parser for your dataset).
Basically, I didn't know that onion-location could be triggered with meta tags up until very recently, and I wondered if this option was actually used (and possibly propose to remove it to avoid HTML injection).
Excluding all the results containing twitter was enough for me not to consider the idea of dropping meta tags anymore.
I'm extending the paper's ACK section with the following (the part in bold):
We would also like to thank the Tor Project, especially [...]; as well as Pier Angelo Vendrame for reporting a bug in how onion-grab noted down results for sites with HTTP redirects to other O-L sites.
The paper is not yet public, so please let me know the coming weeks or so if you would rather not be acknowledged or if you don't like the wording @pierov. Thanks again for reporting (I'm backlogged on providing a new tag with a fix, but on it now isch).