There are companies - such as CloudFlare - which are effectively now Global Active Adversaries. Using CF as an example - they do not appear open to working together in open dialog, they actively make it nearly impossible to browse to certain websites, they collude with larger surveillance companies (like Google), their CAPTCHAs are awful, they block members of our community on social media rather than engaging with them and frankly, they run untrusted code in millions of browsers on the web for questionable security gains.
It would be great if they allowed GET requests - for example - such requests should not and generally do not modify server side content. They do not do this - this breaks the web in so many ways, it is incredible. Using wget with Tor on a website hosted by CF is... a disaster. Using Tor Browser with it - much the same. These requests should be idempotent according to spec, I believe.
I would like to find a solution with Cloudflare - but I'm unclear that the correct answer is to create a single cookie that is shared across all sessions - this effectively links all browsing for the web. When tied with Google, it seems like a basic analytics problem to enumerate users and most sites visited in a given session.
One way - I think - would be to create a warning page upon detection of a CF edge or captcha challenge. This could be similar to an SSL/TLS warning dialog - with an option for users to bypass, engage with their systems or an option to contact them or the site's owners or to hit a cached version, read only version of the website that is on archive.org, archive.is or other caching systems. That would ensure that millions of users would be able to engage with informed consent before they're tagged, tracked and potentially deanonymized. TBB can protect against some of this - of course - but when all your edge nodes are run by one organization that can see plaintext, ip addresses, identifiers and so on - the protection is reduced. It is an open research question how badly it is reduced but intuitively, I think there is a reduction in anonymity.
It would be great to find a solution that allows TBB users to use the web without changes on our end - where they can solve one captcha, if required - perhaps not even prompting for GET requests, for example. Though in any case - I think we have to consider that there is a giant amount of data at CF - and we should ensure that it does not harm end users. I believe CF would share this goal if we explain that we're all interested in protecting users - both those hosting and those using the websites.
Some open questions:
What kind of per browser session tracking is actually happening?
What other options do we have on the TBB side?
What would a reasonable solution look like for a company like Cloudflare?
What is reasonable for a user to do? (~17 CAPTCHAs for one site == not reasonable)
Would "Warning this site is under surveillance by Cloudflare" be a reasonable warning or should we make it more general?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Disclaimer: I work for CloudFlare. Disclaimer: Comments here are opinions of myself, not my employer.
I will restrain myself and not comment on the political issues Jacob raised. I'll keep it technical.
I would like to find a solution with Cloudflare - but I'm unclear that the correct answer is to create a single cookie that is shared across all sessions - this effectively links all browsing for the web.
A thousand times yes. I raised this option a couple times (supercookie) and we agreed this is a bad idea. I believe there is a cryptographic solution to this. I'm not a crypto expert, so I'll allow others to explain this. Let's define a problem:
There are CDN/DDoS companies in the internet that provide spam protection for their customers. To do this they use captchas to prove that the visitor is a human. Some companies provide protection to many websites, therefore visitor from abusive IP address will need to solve captcha on each and all domains protected. Let's assume the CDN/DDoS don't want to be able to correlate users visiting multiple domains. Is it possible to prove that a visitor is indeed human, once, but not allow the CDN/DDoS company to deanonymize / correlate the traffic across many domains?
In other words: is it possible to provide a bit of data (i'm-a-human) tied to the browsing session while not violating anonymity.
Disclaimer: I work for CloudFlare. Disclaimer: Comments here are opinions of myself, not my employer.
Could you please ask your employer or other coworkers to come and talk with us openly? Many members of our community, some which are also your (server side) users, are extremely frustrated. It is in the best interest of everyone to help find a solution for those users.
I will restrain myself and not comment on the political issues Jacob raised. I'll keep it technical.
What specifically is political versus technical? That CF is now a GAA? That CF does indeed gather metrics? That CF does run untrusted (by me, or other users) in our browsers? That your metrics count as a kind of surveillance that is seemingly linked with a PRISM provider?
I would like to find a solution with Cloudflare - but I'm unclear that the correct answer is to create a single cookie that is shared across all sessions - this effectively links all browsing for the web.
A thousand times yes. I raised this option a couple times (supercookie) and we agreed this is a bad idea.
What is the difference between one super cookie and ~1m cookies on a per site basis? The anonymity set appears to be strictly worse. Or do you guys not do any stats on the backend? Do you claim that you can't and don't link these things?
I believe there is a cryptographic solution to this. I'm not a crypto expert, so I'll allow others to explain this. Let's define a problem:
There are CDN/DDoS companies in the internet that provide spam protection for their customers. To do this they use captchas to prove that the visitor is a human. Some companies provide protection to many websites, therefore visitor from abusive IP address will need to solve captcha on each and all domains protected. Let's assume the CDN/DDoS don't want to be able to correlate users visiting multiple domains. Is it possible to prove that a visitor is indeed human, once, but not allow the CDN/DDoS company to deanonymize / correlate the traffic across many domains?
Here is a non-cryptographic, non-cookie based solution: Never prompt for a CAPTCHA on GET requests.
For such a user - how will you protect any information you've collected from them? Will that information be of higher value or richer technical information if there is a cookie (super, regular, whatever) tied to that data?
In other words: is it possible to provide a bit of data (i'm-a-human) tied to the browsing session while not violating anonymity.
This feels like a trick question - behavioral analysis is in itself reducing the anonymity set by adding at least one bit of information. My guess is that it is a great deal more than a single bit - especially over time.
There are CDN/DDoS companies in the internet that provide spam protection for their customers. To do this they use captchas to prove that the visitor is a human. Some companies provide protection to many websites, therefore visitor from abusive IP address will need to solve captcha on each and all domains protected. Let's assume the CDN/DDoS don't want to be able to correlate users visiting multiple domains. Is it possible to prove that a visitor is indeed human, once, but not allow the CDN/DDoS company to deanonymize / correlate the traffic across many domains?
In other words: is it possible to provide a bit of data (i'm-a-human) tied to the browsing session while not violating anonymity.
This sounds very much like something that could be provided through the use of zero-knowledge proofs. It doesn't seem clear to me that being able to say "this is an instance of tor which has already answered a bunch of captcha's" is actually useful. I think the main problem with captchas at this point is that robots are just about as good at answering them as humans. Apparently robots are worse than humans at building up tracked browser histories. That seems like a harder property for a tor user to prove.
What sort of data would qualify as an 'i'm a human' bit?
What sort of data would qualify as an 'i'm a human' bit?
I don't think DDoS should be based on identifying humans.
Bots are legitimate consumers of data as well, and in the future they might even be more intelligent than most humans today, so we might as well design our systems to be friendly for them.
DDoS is a supply/demand type of economic issue and any solutions should treat it as such.
Ultimately, I wonder if the point is simply to identify people - across browser sessions, across proxies, across Tor exits - and the start is the "I'm a human bit" - I wonder where does that end?
In a sense, I feel like this CF issue is like a giant Wifi Captive Portal for the web. It shims in some kind of "authentication" in a way that breaks many existing protocols and applications.
If I was logged into Google (as they use a Google Captcha...), could they vouch for my account and auto solve it? Effectively creating an ID system for the entire web where CF is the MITM for all the users visiting users cached/terminated by them? I think - yes to both - and that is concerning.
Disclaimer: I work for CloudFlare. Disclaimer: Comments here are opinions of myself, not my employer.
I will restrain myself and not comment on the political issues Jacob raised. I'll keep it technical.
I would like to find a solution with Cloudflare - but I'm unclear that the correct answer is to create a single cookie that is shared across all sessions - this effectively links all browsing for the web.
A thousand times yes. I raised this option a couple times (supercookie) and we agreed this is a bad idea. I believe there is a cryptographic solution to this. I'm not a crypto expert, so I'll allow others to explain this. Let's define a problem:
There are CDN/DDoS companies in the internet that provide spam protection for their customers. To do this they use captchas to prove that the visitor is a human. Some companies provide protection to many websites, therefore visitor from abusive IP address will need to solve captcha on each and all domains protected. Let's assume the CDN/DDoS don't want to be able to correlate users visiting multiple domains. Is it possible to prove that a visitor is indeed human, once, but not allow the CDN/DDoS company to deanonymize / correlate the traffic across many domains?
In other words: is it possible to provide a bit of data (i'm-a-human) tied to the browsing session while not violating anonymity.
Yes. This is a problem that "Anonymous Credential" systems are designed to solve. A example of a system with most of the properties that are desired is presented in Au, M. H., Kapadia, A., Susilo, W., "BLACR: TTP-Free Blacklistable Anonymous Credentials with Reputation" (https://www.cs.indiana.edu/~kapadia/papers/blacr-ndss-draft.pdf). Note that this is still an active research area, and BLACR it of itself may not be practical/feasible to implement, and is listed only as an example since the paper gives a good overview of the problem and how this kind of primitive can be used to solve the problem.
Isis can go into more details on this sort of thing, since she was trying to implement a similar thing based on Mozilla Persona (aborted attempt due to Mozilla Persona being crap).
Trac: Cc: arthuredelstein to arthuredelstein, isis
@ioerror: you are doing this again. You are mixing your opinions with technical reality. Please stop insulting me. Please focus on what can we can technically do to fix the problem.
Here is a non-cryptographic, non-cookie based solution: Never prompt for a CAPTCHA on GET requests.
There are a number of problems with this model.
(POST is hard) First, what actually the proxy should do on the POST? Abort your POST, serve captcha, and ask you to fill the POST again? Or accept your 10meg upload, serve captcha and ask you to upload it again? Now think about proxy behaviour during an attack. Doing captcha validation on POST is not a trivial thing.
(blocking regions) Second, during an "attack" (call it ddos or something) the website owners often decide to block traffic from ceirtain regions. Many businesses care only about visitors from some geographical region, and in case of a DDoS are happy to just DROP traffic from other regions. This is not something to like or dislike. This is a reality for many website owners. Serving captcha is strictly better than disallowing the traffic unconditionally.
(Not only spam, load as well) Third, there regularly are bot "attacks" that just spam website with continous flood of GET requests, for example to check if the offered product is released, the promotion started or price updated. This is a problem for some website owners and they wish to allow only traffic from vetted sessions.
The underlying problem, is that for any ddos / spam protection system the source IP address is a very strong signal. Unfortunately many Tor exit IP's have bad IP reputation, because they ARE often used for unwanted activity.
Here is a non-cryptographic, non-cookie based solution: Never prompt for a CAPTCHA on GET requests.
There are a number of problems with this model.
(POST is hard) First, what actually the proxy should do on the POST? Abort your POST, serve captcha, and ask you to fill the POST again? Or accept your 10meg upload, serve captcha and ask you to upload it again? Now think about proxy behaviour during an attack. Doing captcha validation on POST is not a trivial thing.
CloudFlare is in a position to inject JavaScript into sites. Why not hook requests that would result in a POST and challenge after say, clicking the submit button?
What sort of data would qualify as an 'i'm a human' bit?
Let's start with something not-worse than now: a captcha solved in last minutes.
Is this something that CloudFlare has actually found effective? Are there metrics on how many challenged requests that successfully solved a CAPTCHA turned out to actually be malicious?
CloudFlare is in a position to inject JavaScript into sites
This alone should be reason enough for the security warning. People might be viewing sites which they believe to be in a different jurisdiction and suddenly giving control to a US entity.
To quantify the scope of the problem slightly, a few weeks ago I measured that 10% of the Alexa top 25k are behind Cloudflare.
It would be helpful if we had a nice, well written, easy to understand explanation of the problem that we could give to site owners. Of those that I have contacted, some get it and adjust things quickly, but some struggle to understand what the problem is.
@ioerror: you are doing this again. You are mixing your opinions with technical reality. Please stop insulting me. Please focus on what can we can technically do to fix the problem.
I'm unclear on what I've said or done that is insulting you? Could you clarify? It certainly isn't my attempt or intent to insult you.
What is my opinion and what is technical reality? Could you enumerate that a bit? I've asked many questions and it is important that we discuss the wide range of topics here.
Here is a non-cryptographic, non-cookie based solution: Never prompt for a CAPTCHA on GET requests.
There are a number of problems with this model.
There are a number of problems with the current model - to be clear - and so while there are downsides to the read-only GET suggestion, I think it would reduce nearly all complaints by end users.
(POST is hard) First, what actually the proxy should do on the POST? Abort your POST, serve captcha, and ask you to fill the POST again? Or accept your 10meg upload, serve captcha and ask you to upload it again? Now think about proxy behaviour during an attack. Doing captcha validation on POST is not a trivial thing.
Off the top of my head - to ensure I reply to everything you've written:
It seems reasonable in many cases to redirect them on pages where this is a relevant concern? POST fails, failure page asks for a captcha solution, etc.
(blocking regions) Second, during an "attack" (call it ddos or something) the website owners often decide to block traffic from ceirtain regions. Many businesses care only about visitors from some geographical region, and in case of a DDoS are happy to just DROP traffic from other regions. This is not something to like or dislike. This is a reality for many website owners. Serving captcha is strictly better than disallowing the traffic unconditionally.
Actually, a censorship page with specific information ala HTTP 451 would be a nearly in spec answer to this problem. Why not use that? You're performing geographic discrimination on behalf of your users - this censorship should be transparent. It should be clear that the site owner has decided to do this - and there is less of a need to solve a captcha by default.
Though in the case of Tor - you can't do this properly - which is a reason to specifically treat Tor users as special. Visitors may be in the region and Tor is properly hiding them. That is a point in the direction of having an interstitial page that allows a user to solve a captcha.
(Not only spam, load as well) Third, there regularly are bot "attacks" that just spam website with continous flood of GET requests, for example to check if the offered product is released, the promotion started or price updated. This is a problem for some website owners and they wish to allow only traffic from vetted sessions.
Why not just serve them an older cached copy?
The underlying problem, is that for any ddos / spam protection system the source IP address is a very strong signal. Unfortunately many Tor exit IP's have bad IP reputation, because they ARE often used for unwanted activity.
What sort of data would qualify as an 'i'm a human' bit?
Let's start with something not-worse than now: a captcha solved in last minutes.
This feels circular - one of the big problems is that users are unable to solve them after a dozen tries. We would not have as many complaining users if we could get this far, I think.
This sounds very much like something that could be provided through the use of zero-knowledge proofs
Yup. What do we do to implement one both on ddos protection side and on TBB side?
My first order proposition would be to solve a cached copy of the site in "read only" mode with no changes on the TBB side. We can get this from other third parties if CF doesn't want to serve it directly - that was part of my initial suggestion. Why not just serve that data directly?
Maybe CloudFlare could be persuaded to use CAPTCHAs more precisely?
That is, present a CAPTCHA only when:
the server owner has specifically requested that CAPTCHAs be used
the server is actively under DoS attack, and
the client's IP address is currently a source of the DoS.
I think it's hugely overkill to show CAPTCHAs all the time to all Tor users for every CloudFlare site. It's also unreasonable to maintain a "reputation" for a Tor exit node.
On top of this, Google's reCAPTCHA is buggy and frequently impossible to solve. Has CloudFlare considered other CAPTCHAs, or discussed reCAPTCHA's problems with Google?
Maybe CloudFlare could be persuaded to use CAPTCHAs more precisely?
That is, present a CAPTCHA only when:
the server owner has specifically requested that CAPTCHAs be used
the server is actively under DoS attack, and
the client's IP address is currently a source of the DoS.
That seems interesting - I wish we had data to understand if these choices would help - it seems opaque how "threat scores" for IP addresses are computed. Is there any public information about it?
I think it's hugely overkill to show CAPTCHAs all the time to all Tor users for every CloudFlare site. It's also unreasonable to maintain a "reputation" for a Tor exit node.
I agree.
On top of this, Google's reCAPTCHA is buggy and frequently impossible to solve. Has CloudFlare considered other CAPTCHAs, or discussed reCAPTCHA's problems with Google?
I'm also interested in understanding the dataflow - could the FBI go to Google to get data on all CloudFlare users? Does CF protect it? If so - who protects users more?
While we do provide a feature that caches old versions of sites (called Always Online), it is not enabled by default. And even if it was you can imagine site owners disabling that. Furthermore it is totally possible for the url to not be in cache. Fundamentally Always Online solves a different problem - serving content in event of origin being unavailable. This is different from protecting origin - you want to serve challenge to bots, not content.
I'll add one more aspect here - in some large attacks we struggle to even serve captchas. The bots request them over and over again, which generates big traffic. Captcha page is optimised for size. We certainly don't want to serve larger sites to suspected-bad IP addresses in order to shield our servers as well.
Do you have any open data on this?
No, but the bad IP reputation for TOR exits is not generated by rolling a dice.
This feels circular - one of the big problems is that users are unable to solve them after a dozen tries
arthuredelstein: On top of this, Google's reCAPTCHA is buggy and frequently impossible to solve.
Maybe this is the problem. But here is a thing - reCaptcha gives different challenges to different IP addresses. Maybe the google IP reputation of TOR exits is so bad that they really don't want this traffic.
Ok, let me try to put the discussion on track again.
I would be very interested in getting the zero-knowledge proofs working. That is - require TBB user to prove he's human exactly once, and then reuse this data across the browsing session, without losing anonymity. This is not CloudFlare specific idea, there are many other providers using captcha. We could have a generic technology for proving "i'm-a-human".
We could have a generic technology for proving "i'm-a-human".
What does attempting to prove "i'm-a-human" have to do with addressing DDoS attacks?
Bots are legitimate consumers of data (as stated above).
Just because something is not human, does not mean you should treat it specially. I thought you were trying to prevent DDoS attacks, not play the Turing Test.
Ok, let me try to put the discussion on track again.
I would be very interested in getting the zero-knowledge proofs working. That is - require TBB user to prove he's human exactly once, and then reuse this data across the browsing session, without losing anonymity. This is not CloudFlare specific idea, there are many other providers using captcha. We could have a generic technology for proving "i'm-a-human".
Building the infrastructure for a zero-knowledge proof system sounds like a fascinating but expensive and long-term project. And I wouldn't be confident that CloudFlare would even adopt such a thing once it became available, unless they made a significant investment in the work at the beginning.
Personally I am more interested in what near-term adjustments CloudFlare could make to reduce the CAPTCHA burden on Tor users, which seems to be unnecessarily high. Marek, do you have any thoughts about my suggestions for reducing CAPTCHA use in comment:17?
There are companies - such as CloudFlare - which are effectively now Global Active Adversaries.
That's an inflammatory introduction. We are not adversarial to TOR as an entity, we are trying to deal with abuse that uses the TOR network. It's inevitable that a system providing anonymity gets abused (as well as used). I'm old enough to remember the trials and tribulations of the Penet remailer and spent a long time working in antispam.
Using CF as an example - they do not appear open to working together in open dialog,
Really? We've had multiple contacts with people working on TOR through events like Real World Crypto and have been trying to come up with a solution that will protect web sites from malicious use of TOR while protecting the anonymity of TOR users (such as myself). We rolled out special handling of the TOR network so that users should not see a CAPTCHA on a circuit change. We also changed the CAPTCHA to the new one since the old was serving very hard to handle text CAPTCHAs to TOR users. The crypto guys who work for me are interested in blinded tokens as a way to solve both the abuse problem and preserve anonymity.
Earlier @ioerror asked if there was open data on abuse from TOR exit nodes. In 2014 I wrote a small program called "torhoney" that pulls the list of exit nodes and matches it against data from Project Honeypot about abuse. That code is here: https://github.com/jgrahamc/torhoney. You can run it and see the mapping between an exit node and its Project Honeypot score to get a sense for abuse from the exit nodes.
I ran the program today and have data on 1,057 exit nodes showing that Project Honeypot marks 710 of them as a source of comment spam (67%) with 567 having a score of greater than 25 (in the Project Honeypot terminology meaning it delivered at least 100 spam messages) (54%). Over time these values have been trending upwards. I've been recording the Project Honeypot data for about 13 months that the percentage of exit nodes that were listed as a source of comment spam was about 45% a year ago and is now around 65%.
So, I'm interested in hearing about technical ways to resolve these problems. Are there ways to reduce the amount of abuse through TOR? Could TorBrowser implement a blinded token scheme that would preserve anonymity and allow a Turing Test?
Sometimes the problem CF seems to be worried about is DDoS sometimes it is comment spam. Those are typically very different things and are protected against in very different ways. Indeed it is quite hard to use Tor to do many of the more common amplification-style DDoS techniques. Can we please try not to muddy the waters by having an ambiguous threat model?
There are companies - such as CloudFlare - which are effectively now Global Active Adversaries.
That's an inflammatory introduction. We are not adversarial to TOR as an entity, we are trying to deal with abuse that uses the TOR network. It's inevitable that a system providing anonymity gets abused (as well as used). I'm old enough to remember the trials and tribulations of the Penet remailer and spent a long time working in antispam.
Sometimes the problem CF seems to be worried about is DDoS sometimes it is comment spam. Those are typically very different things and are protected against in very different ways. Indeed it is quite hard to use Tor to do many of the more common amplification-style DDoS techniques. Can we please try not to muddy the waters by having an ambiguous threat model?
I was giving the example of comment spamming because the Project Honeypot is a third party. It gives you an idea of what's happening through Tor. Comment spam is something we deal with, along with DDoS attacks and hacking of web sites (SQL injection etc.). Different techniques are used for different attack types.
Using CF as an example - they do not appear open to working together in open dialog,
Really? We've had multiple contacts with people working on TOR through events like Real World Crypto and have been trying to come up with a solution that will protect web sites from malicious use of TOR while protecting the anonymity of TOR users (such as myself).
Yes, really.
We rolled out special handling of the TOR network so that users should not see a CAPTCHA on a circuit change.
This has never worked, and I say that as someone who uses the Tor Browser Bundle every day and has for years.
We also changed the CAPTCHA to the new one since the old was serving very hard to handle text CAPTCHAs to TOR users.
You should know that the CAPTCHA still works about 1 in 20 times in my experience, and that didn't change at all after you switched to the "new one."
The crypto guys who work for me are interested in blinded tokens as a way to solve both the abuse problem and preserve anonymity.
That's a nice thought, but you're still completely censoring my use of your customer's websites 95% of the time all day every day and wasting my time during the 5% of times your system 'works'.
I ran the program today and have data on 1,057 exit nodes showing that Project Honeypot marks 710 of them as a source of comment spam (67%) with 567 having a score of greater than 25 (in the Project Honeypot terminology meaning it delivered at least 100 spam messages) (54%). Over time these values have been trending upwards. I've been recording the Project Honeypot data for about 13 months that the percentage of exit nodes that were listed as a source of comment spam was about 45% a year ago and is now around 65%.
This is not a relevant fact for the vast majority of users whose right to read your company infringes on.
So, I'm interested in hearing about technical ways to resolve these problems. Are there ways to reduce the amount of abuse through TOR? Could TorBrowser implement a blinded token scheme that would preserve anonymity and allow a Turing Test?
You clearly don't understand the clearly articulated problem as it was described, and if you expect people to solve your team's inability to implement a censorship system for you, I hope you find the help you need.
There are CDN/DDoS companies in the internet that provide spam protection for their customers. To do this they use captchas to prove that the visitor is a human. Some companies provide protection to many websites, therefore visitor from abusive IP address will need to solve captcha on each and all domains protected. Let's assume the CDN/DDoS don't want to be able to correlate users visiting multiple domains. Is it possible to prove that a visitor is indeed human, once, but not allow the CDN/DDoS company to deanonymize / correlate the traffic across many domains?
In other words: is it possible to provide a bit of data (i'm-a-human) tied to the browsing session while not violating anonymity.
This sounds very much like something that could be provided through the use of zero-knowledge proofs. It doesn't seem clear to me that being able to say "this is an instance of tor which has already answered a bunch of captcha's" is actually useful. I think the main problem with captchas at this point is that robots are just about as good at answering them as humans. Apparently robots are worse than humans at building up tracked browser histories. That seems like a harder property for a tor user to prove.
What sort of data would qualify as an 'i'm a human' bit?
Let's be clear on one point: humans do not request web pages. User-Agents request web pages. When people talk about "prove you're a human", what they really mean is "prove that your User-Agent behaves the way we expect it to".
CloudFlare expect that "good" User-Agents should leave a permanent trail of history between all sites across the web. Humans who decide they don't want this property, and use a User-Agent such as Tor Browser fall outside of CloudFlare's conception of how User-Agents should behave (which conception includes neither privacy nor anonymity), and are punished by CloudFlare accordingly.
It might be true that there is some kind of elaborate ZKP protocol that would allow a user to prove to CloudFlare that their User-Agent behaves the way CloudFlare demands, without revealing all of the user's browsing history to CloudFlare and Google. Among other things, this would require CloudFlare to explicitly and precisely describe both their threat model and their definition of 'good behaviour', which as far as I know they have never done.
However, it is not the Tor Project's job to perform free labour for a censor. If CloudFlare is actually interested in solving the problem, then perhaps the work should be paid for by the $100MM company that created the problem, not done for free by the nonprofit and community trying to help the people who suffer from it.
CloudFlare expect that "good" User-Agents should leave a permanent trail of history between all sites across the web.
No, we do not.
We have a simple need: our customers pay us to protect their web sites from DoS, spam and intrusions using things like SQL injection. We need to provide that service for the money they pay us.
Another way to think about this is to imagine we're not talking about Tor but some other source of abuse. In the past we've worked to shut down open DNS resolvers, open NTP servers, and we work with networks to disable abuse coming from them. We can't do those things with Tor because of its nature. So we're in a tough spot, we see abuse coming from Tor that's hard to deal with because of anonymity.
A related approach might be for us to say "Let's whitelist all the Tor exit nodes". Play that forward a bit and you could see that any abuser worth their salt would migrate to Tor increasing the abuse problem through Tor.
Ultimately, I think we want the same thing: reduce abuse coming through Tor. Coming up with a good technical solution is hard, but worth working on. You may think that CloudFlare doesn't care about this problem, but in fact it's something that's occupying time (and therefore money) as we look for solutions.
Despite what's been said in this ticket there have been contacts between CloudFlare and Tor developers.
In other words: is it possible to provide a bit of data (i'm-a-human) tied to the browsing session while not violating anonymity.
Yes. This is a problem that "Anonymous Credential" systems are designed to solve. A example of a system with most of the properties that are desired is presented in Au, M. H., Kapadia, A., Susilo, W., "BLACR: TTP-Free Blacklistable Anonymous Credentials with Reputation" (https://www.cs.indiana.edu/~kapadia/papers/blacr-ndss-draft.pdf). Note that this is still an active research area, and BLACR it of itself may not be practical/feasible to implement, and is listed only as an example since the paper gives a good overview of the problem and how this kind of primitive can be used to solve the problem.
Isis can go into more details on this sort of thing, since she was trying to implement a similar thing based on Mozilla Persona (aborted attempt due to Mozilla Persona being crap).
Having not read the BLACR paper yet… one should generally be wary of anonymous credentials which advertise some form of revocation, since effectively what this means is having some backdoor whereby a trusted third party can do "anonymity revocation". The other form this usually takes is to keep a blacklist (skimming tells me that BLACR does this), or keep some other form of state, e.g. "all blinded signature tokens we've already seen used before," which additionally introduces the requirement that the credential issuing server be always online.
There are other anonymous credential schemes built on NIZK proofs which do not require keeping expensive (and continually growing) blacklists, one of my personal favourites being described in Belenkiy, Lysyanskaya, Camenisch, Sacham, Chase, and Kohlweiss' "Randomizable Proofs and Delegatable Anonymous Credentials". The delegation aspect could also provide a nice feature of being able to e.g. say "I'll trust any user who has met the authentication requirements of any of Cloudflare, Wikipedia, or Amazon" without necessarily knowing which of those three the user had already authenticated to.
There are companies - such as CloudFlare - which are effectively now Global Active Adversaries.
That's an inflammatory introduction. We are not adversarial to TOR as an entity, we are trying to deal with abuse that uses the TOR network.
It is a statement of facts about capabilities. It is not inflammatory - Tor must take into account that Google, for example, can run arbitrary code from many thousands of websites visited in Tor Browser.
To say that CF is not adversarial is awkward - Tor users are prevented from browsing the web and are constantly blocked. I do not believe that CF has yet made this a specific act of malice, of course. To design such a system without considering how it will impact Tor users and then working with us is however seriously problematic as we see from user reports.
It's inevitable that a system providing anonymity gets abused (as well as used). I'm old enough to remember the trials and tribulations of the Penet remailer and spent a long time working in antispam.
Centralization ensures that your company is a high value target. The ability to run code in the browsers of millions of computers is highly attractive. The fact that CF and Google appear to both appear in those captcha prompts probably ensures CF isn't even in control of the entirety of the risk. Is it the case that for all the promises CF makes, Google is actually in control of the Captcha - and thus is by proxy given the ability to run code in the browsers of users visiting CF terminated sites?
Ultimately, I think we want the same thing: reduce abuse coming through Tor. Coming up with a good technical solution is hard, but worth working on. You may think that CloudFlare doesn't care about this problem, but in fact it's something that's occupying time (and therefore money) as we look for solutions.
Offering a read only version of these websites would be a very good mitigation that could be done effectively instantly - by enabling the above mentioned "Always Online" CDN option - where a CAPTCHA would be added. For any POST action, a javascript hook could be added to then prompt to solve a CAPTCHA as discussed above.
A related approach might be for us to say "Let's whitelist all the Tor exit nodes". Play that forward a bit and you could see that any abuser worth their salt would migrate to Tor increasing the abuse problem through Tor.
That would be a fine approach - it is true that this could be a problem but this would absolutely solve the "defaults" problem we see today.
Despite what's been said in this ticket there have been contacts between CloudFlare and Tor developers.
I am one of those developers and after more than a year, I'm sorry to say that we need to have substantially more serious discussions. Individual engineers who care is not enough. There are also other options - such as some of the things suggested above. I really like the idea of an interstitial that allows a user to see a third party read only CDN cache before remote code execution happens in the user's browser.
In any case - I think we all agree that there is a serious problem here and we should involve our communities and not just have backroom communications that do not result in differences for users. There are millions of impacted users who are being censored from reading websites because of a combination of issues - every single day.
I encourage you to use the Tor Browser for a week and report back to us about how well it works for you. If your experience is completely different from the rest of us, we'd very much like to learn about the different factors in your web surfing habits.
Earlier @ioerror asked if there was open data on abuse from TOR exit nodes. In 2014 I wrote a small program called "torhoney" that pulls the list of exit nodes and matches it against data from Project Honeypot about abuse. That code is here: https://github.com/jgrahamc/torhoney. You can run it and see the mapping between an exit node and its Project Honeypot score to get a sense for abuse from the exit nodes.
I ran the program today and have data on 1,057 exit nodes showing that Project Honeypot marks 710 of them as a source of comment spam (67%) with 567 having a score of greater than 25 (in the Project Honeypot terminology meaning it delivered at least 100 spam messages) (54%). Over time these values have been trending upwards. I've been recording the Project Honeypot data for about 13 months that the percentage of exit nodes that were listed as a source of comment spam was about 45% a year ago and is now around 65%.
This is useful though it is unclear - is this what CF uses on the backend? Is this data the reason that Google's captchas are so hard to solve?
Furthermore - what is the expected value for a network with millions of users per day?
So, I'm interested in hearing about technical ways to resolve these problems. Are there ways to reduce the amount of abuse through TOR? Could TorBrowser implement a blinded token scheme that would preserve anonymity and allow a Turing Test?
Offering a read only version of these websites that prompts for a captcha on POST would be a very basic and simple way to reduce the flood of upset users. Ensuring that a captcha is solved and not stuck in a 14 or 15 solution loop is another issue - that may be a bug unsolvable by CF but rather needs to be addressed by Google. Another option, as I mentioned above, might be to stop a user before ever reaching a website that is going to ask them to run javascript and connect them between two very large end points (CF and Google).
Does Google any end user connections for those captcha requests? If so - it seems like the total set of users for CF would be seen by both Google and CF, meaning that data on all Cloudflare users prompted for the captcha would be available to Google. Is that incorrect?
In any case - I think we all agree that there is a serious problem here and we should involve our communities and not just have backroom communications that do not result in differences for users. There are millions of impacted users who are being censored from reading websites because of a combination of issues - every single day.
I don't agree with your characterization of this as "censoring". That implies an active desire to prevent people from reaching certain types of content. Given all that we've done to uphold free speech in the face of a barrage of criticism I think your use of the word "censor" is unwarranted.
I encourage you to use the Tor Browser for a week and report back to us about how well it works for you. If your experience is completely different from the rest of us, we'd very much like to learn about the different factors in your web surfing habits.
I did this three weeks ago. In addition the entire company was forced for 30 days to see CAPTCHAs any time they visited a site using CloudFlare while in our offices. Doing so caused us to fix lots of problems with the way the CAPTCHA was implemented. I also personally worked on the code that deals with prevention of a CAPTCHA when the circuit changes and fixed a bug that was preventing it working correctly.
This is useful though it is unclear - is this what CF uses on the backend? Is this data the reason that Google's captchas are so hard to solve?
It's a data source that we use for IP reputation. I was using it as illustrative as well because it's a third party. I don't know if there's any connection between Project Honeypot and Google's CAPTCHAs.
Offering a read only version of these websites that prompts for a captcha on POST would be a very basic and simple way to reduce the flood of upset users. Ensuring that a captcha is solved and not stuck in a 14 or 15 solution loop is another issue - that may be a bug unsolvable by CF but rather needs to be addressed by Google. Another option, as I mentioned above, might be to stop a user before ever reaching a website that is going to ask them to run javascript and connect them between two very large end points (CF and Google).
I'm not convinced about the R/O solution. Seems to me that Tor users would likely be more upset the moment they got stale information or couldn't POST to a forum or similar. I'd much rather solve the abuse problem and make this go away completely. Also, the CAPTCHA-loop thing is an issue that needs to be addressed by us and Google.
I still think the blinded tokens thing is going to be interesting to investigate because it would help anonymously prove that the User-Agent was controlled by a human and could be sent eliminating the need for any JavaScript.
Does Google any end user connections for those captcha requests?
A related approach might be for us to say "Let's whitelist all the Tor exit nodes". Play that forward a bit and you could see that any abuser worth their salt would migrate to Tor increasing the abuse problem through Tor.
That would be a fine approach - it is true that this could be a problem but this would absolutely solve the "defaults" problem we see today.
It's a very short term solution because if all the abuse moves to Tor the obvious next step is that our clients come along and demand that we give them the option to block visitors from Tor completely. If we go that way wholesale I think it will be negative for everyone.
In any case - I think we all agree that there is a serious problem here and we should involve our communities and not just have backroom communications that do not result in differences for users. There are millions of impacted users who are being censored from reading websites because of a combination of issues - every single day.
I don't agree with your characterization of this as "censoring". That implies an active desire to prevent people from reaching certain types of content. Given all that we've done to uphold free speech in the face of a barrage of criticism I think your use of the word "censor" is unwarranted.
I don't agree with the characterization of this as mere "blocking" when CF prevents users from reading websites. I haven't even begun to describe the pain of having written lengthy comments only to hit a captcha loop and it censored my speech as well.
It is censorship from where many of our users stand. Some of our Chinese users refer to it as the Great Distributed Firewall that they hit after jumping over the other Great Firewall.
Forgive me for not knowing the other details about Cloudflare and Free Speech - I'm not at all trying to characterize those activities. The active blocking, captcha loop issues are seriously problematic and they have a result which is that websites are unreadable. I'm not claiming you're burning books or something silly. I'm correctly pointing out that the books are safely on the otherside of a locked door and we're being turned into captcha solving machines that often do not unlock the door, if you'll forgive the metaphor.
I encourage you to use the Tor Browser for a week and report back to us about how well it works for you. If your experience is completely different from the rest of us, we'd very much like to learn about the different factors in your web surfing habits.
I did this three weeks ago. In addition the entire company was forced for 30 days to see CAPTCHAs any time they visited a site using CloudFlare while in our offices. Doing so caused us to fix lots of problems with the way the CAPTCHA was implemented. I also personally worked on the code that deals with prevention of a CAPTCHA when the circuit changes and fixed a bug that was preventing it working correctly.
You used it for a week after all of these changes were deployed? And you didn't encounter any issues? You feel that it works perfectly and that there are no valid issues being voiced? Or...?
You used it for a week after all of these changes were deployed? And you didn't encounter any issues? You feel that it works perfectly and that there are no valid issues being voiced? Or...?
I did not encounter the loops that people are talking about. If I had I would have had one of the engineers fix that problem. The biggest thing I encountered was that our "one CAPTCHA per site modulo circuit change" code wasn't working and I fixed it. I'd like to get this to a point where Tor users are not in pain and during our CAPTCHA testing we found some problems which were fixed.
It would be very helpful if someone were able to reproduce the CAPTCHA loop thing so we can address it. I will get an engineer to take look and see if we can reproduce internally.
This is useful though it is unclear - is this what CF uses on the backend? Is this data the reason that Google's captchas are so hard to solve?
It's a data source that we use for IP reputation. I was using it as illustrative as well because it's a third party. I don't know if there's any connection between Project Honeypot and Google's CAPTCHAs.
How do we vet this information or these so-called "threat scores" other than trusting what someone says?
Offering a read only version of these websites that prompts for a captcha on POST would be a very basic and simple way to reduce the flood of upset users. Ensuring that a captcha is solved and not stuck in a 14 or 15 solution loop is another issue - that may be a bug unsolvable by CF but rather needs to be addressed by Google. Another option, as I mentioned above, might be to stop a user before ever reaching a website that is going to ask them to run javascript and connect them between two very large end points (CF and Google).
I'm not convinced about the R/O solution. Seems to me that Tor users would likely be more upset the moment they got stale information or couldn't POST to a forum or similar. I'd much rather solve the abuse problem and make this go away completely.
Are you convinced that it is strictly worse than the current situation? I'm convinced that it is strictly better to only toss up a captcha that loads a Google research when a user is about to interact with the website in a major way.
I do not believe that you can solve abuse on the internet anymore than a country "solve" healthcare or that the hacker community can "solve" surveillance. Abuse is relative and it is part of having free speech on the internet. There is no doubt a problem - but the solution is not to collectively punish millions of people (and their bots who are people too, man :-) ) based on ~1600 ip address "threat" scores.
Also, the CAPTCHA-loop thing is an issue that needs to be addressed by us and Google.
Does that mean that Google, in addition to CF, has data on everyone hitting those captchas?
I still think the blinded tokens thing is going to be interesting to investigate because it would help anonymously prove that the User-Agent was controlled by a human and could be sent eliminating the need for any JavaScript.
I'm not at all convinced that this can be done in the short term and it seems to assume that users only use graphical browsers. Attackers will be able to extract tokens and have farms of people solving things, when they need new tokens, so usually regular users pay the highest price.
Does Google any end user connections for those captcha requests?
Can you rewrite that? Couldn't parse it.
When a user is given a CF captcha - does Google see any request from them directly? Do they see the Tor Exit IP hitting them? Is it just CF or is it also Google? Do both companies get to run javascript in this user's browser?
You used it for a week after all of these changes were deployed? And you didn't encounter any issues? You feel that it works perfectly and that there are no valid issues being voiced? Or...?
I did not encounter the loops that people are talking about. If I had I would have had one of the engineers fix that problem. The biggest thing I encountered was that our "one CAPTCHA per site modulo circuit change" code wasn't working and I fixed it. I'd like to get this to a point where Tor users are not in pain and during our CAPTCHA testing we found some problems which were fixed.
We'd all like that - I'd really like it if it was CAPTCHA free entirely until there is a POST request, for example. A read only version of the website, rather than a CAPTCHA prompt just to read would be better wouldn't it?
It would be very helpful if someone were able to reproduce the CAPTCHA loop thing so we can address it. I will get an engineer to take look and see if we can reproduce internally.
How many people are actively testing with Tor Browser on a daily basis for regressions? Does anyone use it full time?
Why not just blanket disallow POST for TOR exit nodes, that takes care of the bulk of everyone's problems.
That doesn't solve the issue in a proportional manner. It would be better to solve a captcha or use an anonymous token for certain kinds of interactive activity over blanket denial.
It also doesn't solve any of the other issues - such as the code running in people's browsers, the PII collected and so on. I'd rather a user have an option to hit an archive that is unrelated at that point - wouldn't you?
We're adding an "auto-pay" option to the auditor signing keys in GNU Taler to allow the creation of denomination signing keys for automatic payments without user confirmation. https://taler.net/
Automatic payments are a potential deanonymization vector if the attacker can issue as many denomination keys as they like. We'd therefore envision the Tor project being the auditor who limits the issuing of new denomination signing keys.
Ideally, CloudFlare would run a mint whose denomination keys the Tor project signs every few months. Anytime a TBB user solves a CloudFlare CAPTCHA they'd receive stash of token that TBB automatically uses to access pages.
We've actually had some limited discussions with CloudFlare about doing this. I'll speak about it some at the Tor dev meeting later this week. Along with several interesting variations.
I think the idea of using taler is an interesting open research question. It also seems orthogonal to many possible options that do not involve complicated cryptographic solutions with questionable anonymity properties. Using tokens, cookies, anonymous credential or ledger based solutions may be useful once a user tries to do some SQLI - I'm not at all convinced that it is reasonable to require what sounds like "an internet drivers license" or some Chaum scheme to read a web page.
CAPTCHAs are a fundamentally untenable solution to dealing with DDOS attacks. Algorithmic solutions will always catch up to evolving CAPTCHA methods. CloudFlare and other service providers should recognize that is the inevitable direction technology is going and abandon it now.
An alternate solution is a client proof-of-work protocol. This puts a greater burden on attackers attempting to establish many connections than on users who only need one connection. Then once a TLS session is established, the server can determine from behavior of that client whether it's an attacker and drop the connection. We should try to standardize that and get it into TLS implementations so service providers have an easy configuration choice.
A related approach might be for us to say "Let's whitelist all the Tor exit nodes". Play that forward a bit and you could see that any abuser worth their salt would migrate to Tor increasing the abuse problem through Tor.
That would be a fine approach - it is true that this could be a problem but this would absolutely solve the "defaults" problem we see today.
It's a very short term solution because if all the abuse moves to Tor the obvious next step is that our clients come along and demand that we give them the option to block visitors from Tor completely. If we go that way wholesale I think it will be negative for everyone.
Treating Tor as special seems to make sense as it is already treated specially as as ~1600 nodes shared by millions of users seems to just utterly ruin ip reputation schemes.
I also find it hard to believe that "all the abuse" will move to Tor. Even if a great deal of it moved to Tor, we have lots of users and lots of traffic that is not abusive traffic.