Analysis Google's reCAPTCHA v3 system, designed to separate people from bots during website interactions, is more likely to give you the benefit of the doubt as a human if you happen to be signed in to your Google Account – and is more likely to deem you dubious if you're trying to protect your privacy, recent research suggests.
Introduced in October 2018, reCAPTCHA v3 offers a way that web developers can integrate Google's reCAPTCHA API into their web pages and receive a score ranging from 0.0 to 1.0 that indicates the computed likelihood that a website visitor is a bot. A zero means the user is very likely a bot; a one means the user is almost certainly human. You can test how you look to reCAPTCHA v3 here.
Bot detection still leaves something to be desired. According to a research paper to be presented at the RLDM 2019 conference in Montreal, Canada, next month software using machine learning techniques can pass itself off as human more than 90 per cent of the time against reCAPTCHA v3.
That's not especially surprising given that past versions of Google's Voight-Kampff test have been defeated, sending coders back to the drawing board to produce the next, hopefully more robust, bot detection algorithm. One need only look at the billion or two fake accounts Facebook deletes every quarter to understand that distinguishing between people and machines online remains an unsolved problem.
Dying by degrees
A recent W3C proposal to develop a bot test that works better and is also accessible to those with impairments says it has "become clear not only that traditional CAPTCHA continues to be challenging for people with disabilities, but also that it is increasingly insecure and arguably now ill suited to the purpose of distinguishing human individuals from their robotic impersonators."
Google may be able make reCAPTCHA v3 more resistant to machine learning. According to Mohamed Akrout, a doctoral student at the University of Toronto and one of the authors of the paper, the main problem with reCAPTCHA v3 is the fixed position of the "I am not a robot" checkbox.
"So you can check the coordinates of the checkbox by inspecting the HTML file when it appears for the first time, then you ask your bot to go to that position using machine learning," Akrout explained in an email to The Register.
While it would be simply to solve this issue by randomizing the position of the checkbox, he said, that's easier said than done.
"Most well known websites sell specific areas of their web pages (banners, skyscrapers) for ads and companies advertising are paying for that specific positions," said Akrout. "This means that finding a different empty position to show the checkbox at each appearance is challenging. However, we can have a popup on the top of the website but the cost in this case is the user experience."
Because it can be difficult to tell bots from people online, Google looks beyond interaction metrics like mouse movements to data that has privacy implications.
Tor make you dodgy
In attempting to hack reCAPTCHA, Akrout and his colleagues, Ismail Akrout with Telécom ParisTech and Amal Feriani with Ankor AI, found that using Tor, to change your IP address leads to a lower score, as does using a proxy or VPN. They also found that simulation website visits using a signed-in Google Account led to a higher score.
"Google has a first checking layer to filter the potential bot by IP or Google Account connection but once you pass this first layer, then the second layer, which is the actual reCAPTCHA system, classifies your mouse movement pattern," said Akrout. "The first filtering layer is a condition that is not necessary and not sufficient to determine that the user is human. If you satisfy it, you go to the next level: the machine learning classification layer."
The implication is that Google offers a better web experience to Google Account holders, in a way that discourages choices that protect privacy.
"To me, it feels like Google's entire strategy behind reCAPTCHA is to make it harder to protect your privacy," said developer Daniel Shumway in a post on Hacker News. "We've basically given up on the idea that there are tasks only humans can do, and to me v3 feels like Google openly saying, 'You know how we can prove you're not a robot? Because we literally know exactly who you are.' I don't even know if it should be called a CAPTCHA – it feels like it's just identity verification. I don't think this is an acceptable tradeoff."
Developer Armin Sebastian raised this issue in the context of a GitHub issues post in March, claiming that reCAPTCHA regularly blocks the audio challenge that people with visual impairments receive in lieu of visual puzzles when browsing from residential IP addresses.
Using Google Chrome, he said, tends to mitigate the problem. "People have reported some level of success in accessing the audio challenge by switching to Chrome and staying always logged into their Google accounts," he said. "The reCAPTCHA service is also hostile to users connecting from VPNs or anonymizing services such as Tor. "
Another data slurping tool
The popularity of Google's bot catching scheme – v3 can be found on about 650,000 websites – means "people seeking privacy are effectively prevented from accessing large portions of the web," said Sebastian.
The Register asked Mozilla to comment on whether anyone has complained that reCAPTCHA has hindered Firefox users excessively for their technology choices, as some have claimed, but we've not heard back.
Beyond its potential privacy cost, reCAPTCHA has elicited criticism because it's yet another piece of internet technology that strengthens Google's competitive position by feeding with data, like Google Search, Accelerated Mobile Pages, Google Analytics, the Safe Browsing API, and Android, among others.
"Google's evolution of reCAPTCHA has been increasingly focused on determining humanity by passively tracking people across the web, rather than getting people to perform recognition tasks," said Jacob Hoffman-Andrews, senior staff technologist at the Electronic Frontier Foundation, in an email to The Register. "Unfortunately, because Google is so tight-lipped about reCAPTCHA's privacy implications, we're left to guess which data sources it uses to determine your humanity (or 'risk score' in reCAPTCHA v3). But as Google collects more data, from more sites, more apps, and more people, they have a bigger and bigger advantage in running reCAPTCHA."
Hoffman-Andrews argues this makes life online more difficult for people who fall outside of the norms Google defines. "It's not clear how the web will change with the 'risk scores' in reCAPTCHA v3," he said. "If sites use it to lock out users with high risk scores, they may wind up locking out users who simply refuse to allow Google and others to track their browsing history."
He adds that because Google hasn't provided details about reCAPTCHA works internally, it's not clear the company favors users of Google services over those who don't. But he says studies like Akrout's suggest something's amiss.
"If that's really how reCAPTCHA operates, it's definitely unfair to people who choose not to use Google services," he said. "Making the web slightly more hostile to non-Google users is one way to drive more people towards Google's services."
Google-whisperers beat reCaptcha voice challenge with 90% success rateREAD MORE
The Register was planning to speak with Google engineers involved with reCAPTCHA on Friday though the call was unfortunately cancelled.
Instead, a Google spokesperson provided this statement: "We do not disclose our security methods because we want to prevent bad-actors from using that information to evade detection and attack sites across the internet."
Google maintains that reCAPTCHA is only used to fight spam and abuse. And the company insists it doesn't use the information it collects from the service for advertising.
"The reCAPTCHA API works by collecting hardware and software information, such as device and application data, and sending these data to Google for analysis," a company spokesperson said in an email to The Register. "The information collected in connection with your use of the service will be used for improving reCAPTCHA and for general security purposes. It will not be used for personalized advertising by Google."
Google has yet to explain specifically how it does use the information it collects. ®
- Black Hat
- Cybersecurity and Infrastructure Security Agency
- Cybersecurity Information Sharing Act
- Data Breach
- Data Protection
- Data Theft
- Google AI
- Google Cloud Platform
- Google Nest
- Identity Theft
- Palo Alto Networks
- Tavis Ormandy