reCAPTCHA on bridges.torproject.org are impossible to solve for humans

added bridgdb-0.1.5 in Legacy / Trac component::circumvention/bridgedb in Legacy / Trac owner::isis in Legacy / Trac priority::high in Legacy / Trac resolution::fixed in Legacy / Trac status::closed in Legacy / Trac type::defect in Legacy / Trac labels

I've been considering this, as well. infinity0 pointed us to Asirra [0] (yes, I know, Microsoft) a few days ago, which looks promising, but we can also consider using something that runs locally, or maybe even try both at the same time for a month (alternate between them for each request) and see which one appears to work better. Other suggestions are also welcomed.

[0] https://research.microsoft.com/en-us/um/redmond/projects/asirra/

I always have to magnify the size of the captcha by pressing Ctrl-+ serveral times in order to be able to solve it. Without that it is really hard. Showing the captcha larger would surely help. Is it possible to enlarge it by default?

Trac:
Username: torland

There is also a python API which basically scripts GIMP to create a local cache of CAPTCHAs, made by the SpiderOak developers: https://spideroak.com/code

It's a bit annoying because it's not packaged properly (there's just a tarball and a checksum hash; it's not in PyPI or distro repos), so updates will be annoying, but it was the only decent thing I found when I looked into this a few months ago.

Trac:
Owner: N/A to isis
Status: new to accepted

I sent an email to recaptcha support. Let's see what they say.

The reCAPTCHA team should be whitelisting bridges.torproject.org soon. This will hopefully greatly improve the situation.

I spent about half an hour last night reviewing and testing the spideroak python/gimp captcha generation script. It works. I tweaked it a bit (to make it harder, actually, since by default it's only 5 letters/numbers). (no commits yet because i only fiddled with it)

Before I go any farther, there are the following open questions about doing local captcha generation:

Can this run on a headless server?
This is highly resource intensive, on my laptop it took ~8 minutes to generate 2,000 captchas. Can BridgeDB handle this? Should we run it elsewhere and sync them to BridgeDB?
Is this something we want?

For making this easier for mobile users, there is this SE answer which includes a mobile captcha HTML5 trace system, however it uses HTML5 canvases.

Just for the record, the CAPTCHAs still look very difficult to me currently.

[..]

Before I go any farther, there are the following open questions about doing local captcha generation:

Can this run on a headless server?

Likely yes (in the extreme case just run a X server inside VNC or so); likely the biggest problem is the amount of random that is available to the system, hence a hardware rng should be present on the box.

This is highly resource intensive, on my laptop it took ~8 minutes to generate 2,000 captchas. Can BridgeDB handle this? Should we run it elsewhere and sync them to BridgeDB?

If would suggest having one or more machines supply a bunch of captchas in bulk to BridgeDB. Do we have an idea of how many queries are being made at the moment? It should then be easy to provide a daily bunch of captchas to the host.

Is this something we want?

I would say, yes IMHO, as the current captchas are unreadable. Also they depend on an external rather untrustworthy entity (IMHO); though they are unable to see the source of the queries, they can at least see that queries are being made, how many there are etc and possibly correlate them with other events that they have their eyes on (eg if people are silly and use their public DNS system...).

Trac:
Username: massar

recaptcha(google) doesn't just solve books anymore, they read house numbers for maps, names, and whatever else they need into their monolith. If helping to develop that bothers you, you need to ditch external captchas in favor of internal engines. One such engine... http://www.phpcaptcha.org/ (SecurImage) http://www.graphicsmagick.org/

If you get word from google why they seem to serve hard captchas to most Tor IP's and easy to non-tor, post it here.

The CAPTCHA looks just the same as before. Maybe their support team could look what is happening?

Users keep complaining.

At least 3 new complaints on the help desk today.

Replying to lunar:

The CAPTCHA looks just the same as before. Maybe their support team could look what is happening?

See legacy/trac#10834 (moved), which is merged for 0.1.5 (current deployment is 0.1.4).

Also see https://github.com/isislovecruft/gimp-captcha

Trac:
Keywords: N/A deleted, bridgdb-0.1.5 added

Okay, my branch for this work is done, and it seems to work well, but I have not yet written unittests for it. I would prefer it also had more documentation on how to generate the CAPTCHAs.

However, the work itself is ready for review while I finish the unittests. I am changing the priority to 'major' because of all the complaints going to the support desk.

This adds support for using CAPTCHAs from a local directory (created with my Gimp+Python CAPTCHA generation scripts). It also works with my branch for legacy/trac#11127 (moved).

Trac:
Priority: normal to major
Cc: N/A to sysrqb, isis

Replying to isis:

This adds support for using CAPTCHAs from a local directory (created with my Gimp+Python CAPTCHA generation scripts). It also works with my branch for legacy/trac#11127 (moved).

It looks sane! (I actually reviewed your fix/11127-recaptcha-ssl_10809r1_r1, but putting GimpCaptcha review here)

I haven't reviewed GimpCaptchaTests yet, nor run the code, but based on the review I think there are only two things that we might want to change.

(as i mentioned earlier) it would be nice if we could use both captcha systems at the same time, so creating a CaptchaProtectedResource class that wraps ReCaptcha and Gimp, selecting one when we receive a request with a preset probability, seems like the easiest way to do it. The hard part, it seems, will be determining which system was chosen when we receive the challenge and solution from the client (but this shouldn't be too difficult).
the Gimp code looks good, but I think it would be better if the challenges were pinned to a time period, e.g. in GimpCaptcha.createChallenge() prepend the next 5 minute time period to the encrypted text when you create the hmac for the challenge. Then, in GimpCaptcha.check(), verify the captcha was sent to the client within the previous 5 minute period or the current 5 minute period, and continue processing if one of these is true but not both. (I have no affinity to 5 minute time periods :))

Trac:

example gimp-captcha

Replying to sysrqb:

Replying to isis:

This adds support for using CAPTCHAs from a local directory (created with my Gimp+Python CAPTCHA generation scripts). It also works with my branch for legacy/trac#11127 (moved).

It looks sane! (I actually reviewed your fix/11127-recaptcha-ssl_10809r1_r1, but putting GimpCaptcha review here)

I haven't reviewed GimpCaptchaTests yet, nor run the code, but based on the review I think there are only two things that we might want to change.

(as i mentioned earlier) it would be nice if we could use both captcha systems at the same time, so creating a CaptchaProtectedResource class that wraps ReCaptcha and Gimp, selecting one when we receive a request with a preset probability, seems like the easiest way to do it. The hard part, it seems, will be determining which system was chosen when we receive the challenge and solution from the client (but this shouldn't be too difficult).

I am thinking of making this a separate enhancement ticket, since I think the fine people helping the support desk will have a better quality of life if we first make human-passable Turing tests.

One thing that has just occurred to me is that, if either reCaptcha or the gimp-captchas are considered easier, and we have a probablistic wrapper resource for choosing one or the other, couldn't a user just refresh until they get the easier one? I mean, the webserver isn't stateful between one request and the next. Making it stateful would mean rewriting most of it.

the Gimp code looks good, but I think it would be better if the challenges were pinned to a time period, e.g. in GimpCaptcha.createChallenge() prepend the next 5 minute time period to the encrypted text when you create the hmac for the challenge. Then, in GimpCaptcha.check(), verify the captcha was sent to the client within the previous 5 minute period or the current 5 minute period, and continue processing if one of these is true but not both. (I have no affinity to 5 minute time periods :))

Yeah, I totally agree. There is a TODO comment about it in the commit message for eeb6956ed7f7ddd0f2592c17f4a5d58a580fb878.

Trac:
Status: accepted to needs_revision

Merged for version 0.1.5 in this commit. See legacy/trac#11215 (moved) for follow ticket on timestamps/expiry. Sysrqb and I agreed that this wouldn't be deployed yet, the deployed version, for now, will use the branches from legacy/trac#11127 (moved) to legacy/trac#10834 (moved) to try to solve the CAPTCHA difficulty problems.

Trac:
Status: needs_revision to closed
Resolution: N/A to fixed

closed

mentioned in issue legacy/trac#10831 (moved)

mentioned in issue legacy/trac#11127 (moved)

mentioned in issue legacy/trac#11215 (moved)

mentioned in issue legacy/trac#16072 (moved)

moved from legacy/trac#10809 (moved)

added Bug label and removed 1 deleted label

removed 1 deleted label

reCAPTCHA on bridges.torproject.org are impossible to solve for humans

Designs

Child items ...

Activity