Whatever Google has in mind to replace its reCaptcha had better be ready soon: another research group has found a way to defeat it.
Late last week, researchers from startup Vicarious demonstrated their attack against reCaptcha's image-based “I'm not a robot” proof. Now University of Maryland boffins have busted Google's audio accessibility feature.
The University's Kevin Bock, Daven Patel, George Hughey and Dave Levin call unCaptcha a low-resource defeat of the audio challenge that usually beats reCaptcha “less time than it takes to even play the audio challenge!”
By the numbers, they claim 85.15 per cent accuracy in 5.42 seconds over 450 reCaptcha challenges from live Websites.
The secret to getting cracking reCaptcha's audio so quickly? The cloud, of course: rather than running their own audio analysis, the researchers used multiple online speech-to-text services, as described in full in this paper [PDF]:
- Download the audio captcha;
- Segment the audio into individual digit audio clips;
- Upload each segment to multiple online speech-to-text services;
- Convert these services' responses to digits including:
- Exact homophones: If it is "one" "two", etc., then guess that number
- Near homophones: If it sounds like a digit, like "true" sounds like "two", then guess what it sounds like
- Ensemble the multiple services together by taking a weighted vote based on confidence;
- And finally upload the answer.
Some of this had already been demonstrated, the researchers explain, in a project called ReBreakCaptcha posted at GitHub in February.
The University of Maryland researchers say their main contribution is in improving audio pre-processing, so the online speech-to-text converters work more accurately.
Segmenting the audio, the paper explains, is easy: with no background noise added to reCaptcha, the pre-processing need only identify silence. With the segments identified, unCaptcha then adds two steps not attempted in ReBreakCaptcha: phonetic mapping, and ensembling.
The phonetic mapping stage handles things like homophones (too/two/to/2) and near-homophones (free/3, sex/6 – the latter, of course, a true homophone in New Zealand) so that what's passed to the speech-to-text engine gets a more accurate result.
“Ensembling” helps sort out which speech-to-text engines are most accurate by weighting results: “In essence, each candidate answer gets a weighted vote; the answer with the highest weight wins”, the paper says.
Possible countermeasures the paper suggests include giving reCaptcha a bigger vocabulary for its audio challenges, adding background noise to make it harder to segment the challenge into individual words, or making the challenge a set of instructions like “move your mouse upwards” or “type this word”. ®