Several users complained that the CAPTCHA displayed on bridges.torproject.org were very hard to impossible for them to solve. For me it usually takes 10 to 20 tries, when I do not give up. The 'H' in CAPTCHA means “humans”. If humans cannot even solve the CAPTCHA offered by reCAPTCHA, then we have a problem.
Can we consider migrating away from reCAPTCHA? Also even if we proxy the queries to Google, not relying on Google would make me feel better.
Otherwise, the problem needs to be taken to reCAPTCHA admins.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I've been considering this, as well. infinity0 pointed us to Asirra [0] (yes, I know, Microsoft) a few days ago, which looks promising, but we can also consider using something that runs locally, or maybe even try both at the same time for a month (alternate between them for each request) and see which one appears to work better. Other suggestions are also welcomed.
I always have to magnify the size of the captcha by pressing Ctrl-+ serveral times in order to be able to solve it. Without that it is really hard. Showing the captcha larger would surely help. Is it possible to enlarge it by default?
There is also a python API which basically scripts GIMP to create a local cache of CAPTCHAs, made by the SpiderOak developers: https://spideroak.com/code
It's a bit annoying because it's not packaged properly (there's just a tarball and a checksum hash; it's not in PyPI or distro repos), so updates will be annoying, but it was the only decent thing I found when I looked into this a few months ago.
I spent about half an hour last night reviewing and testing the spideroak python/gimp captcha generation script. It works. I tweaked it a bit (to make it harder, actually, since by default it's only 5 letters/numbers). (no commits yet because i only fiddled with it)
Before I go any farther, there are the following open questions about doing local captcha generation:
Can this run on a headless server?
This is highly resource intensive, on my laptop it took ~8 minutes to generate 2,000 captchas. Can BridgeDB handle this? Should we run it elsewhere and sync them to BridgeDB?
Before I go any farther, there are the following open questions about doing local captcha generation:
Can this run on a headless server?
Likely yes (in the extreme case just run a X server inside VNC or so); likely the biggest problem is the amount of random that is available to the system, hence a hardware rng should be present on the box.
This is highly resource intensive, on my laptop it took ~8 minutes to generate 2,000 captchas. Can BridgeDB handle this? Should we run it elsewhere and sync them to BridgeDB?
If would suggest having one or more machines supply a bunch of captchas in bulk to BridgeDB. Do we have an idea of how many queries are being made at the moment? It should then be easy to provide a daily bunch of captchas to the host.
Is this something we want?
I would say, yes IMHO, as the current captchas are unreadable. Also they depend on an external rather untrustworthy entity (IMHO); though they are unable to see the source of the queries, they can at least see that queries are being made, how many there are etc and possibly correlate them with other events that they have their eyes on (eg if people are silly and use their public DNS system...).
recaptcha(google) doesn't just solve books anymore, they read house numbers for maps, names, and whatever else they need into their monolith. If helping to develop that bothers you, you need to ditch external captchas in favor of internal engines. One such engine...
http://www.phpcaptcha.org/ (SecurImage)
http://www.graphicsmagick.org/
If you get word from google why they seem to serve hard captchas to most Tor IP's and easy to non-tor, post it here.
Okay, my branch for this work is done, and it seems to work well, but I have not yet written unittests for it. I would prefer it also had more documentation on how to generate the CAPTCHAs.
However, the work itself is ready for review while I finish the unittests. I am changing the priority to 'major' because of all the complaints going to the support desk.
It looks sane! (I actually reviewed your fix/11127-recaptcha-ssl_10809r1_r1, but putting GimpCaptcha review here)
I haven't reviewed GimpCaptchaTests yet, nor run the code, but based on the review I think there are only two things that we might want to change.
(as i mentioned earlier) it would be nice if we could use both captcha systems at the same time, so creating a CaptchaProtectedResource class that wraps ReCaptcha and Gimp, selecting one when we receive a request with a preset probability, seems like the easiest way to do it. The hard part, it seems, will be determining which system was chosen when we receive the challenge and solution from the client (but this shouldn't be too difficult).
the Gimp code looks good, but I think it would be better if the challenges were pinned to a time period, e.g. in GimpCaptcha.createChallenge() prepend the next 5 minute time period to the encrypted text when you create the hmac for the challenge. Then, in GimpCaptcha.check(), verify the captcha was sent to the client within the previous 5 minute period or the current 5 minute period, and continue processing if one of these is true but not both. (I have no affinity to 5 minute time periods :))
It looks sane! (I actually reviewed your fix/11127-recaptcha-ssl_10809r1_r1, but putting GimpCaptcha review here)
I haven't reviewed GimpCaptchaTests yet, nor run the code, but based on the review I think there are only two things that we might want to change.
(as i mentioned earlier) it would be nice if we could use both captcha systems at the same time, so creating a CaptchaProtectedResource class that wraps ReCaptcha and Gimp, selecting one when we receive a request with a preset probability, seems like the easiest way to do it. The hard part, it seems, will be determining which system was chosen when we receive the challenge and solution from the client (but this shouldn't be too difficult).
I am thinking of making this a separate enhancement ticket, since I think the fine people helping the support desk will have a better quality of life if we first make human-passable Turing tests.
One thing that has just occurred to me is that, if either reCaptcha or the gimp-captchas are considered easier, and we have a probablistic wrapper resource for choosing one or the other, couldn't a user just refresh until they get the easier one? I mean, the webserver isn't stateful between one request and the next. Making it stateful would mean rewriting most of it.
the Gimp code looks good, but I think it would be better if the challenges were pinned to a time period, e.g. in GimpCaptcha.createChallenge() prepend the next 5 minute time period to the encrypted text when you create the hmac for the challenge. Then, in GimpCaptcha.check(), verify the captcha was sent to the client within the previous 5 minute period or the current 5 minute period, and continue processing if one of these is true but not both. (I have no affinity to 5 minute time periods :))