Google reCAPTCHA service under the microscope: Questions raised over privacy promises, cookie use

Web giant insists anti-bot service isn't used for personalized ads – but cookie claims don't quite add up


Analysis Six years ago, Google revised its reCAPTCHA service, designed to filter out bots, scrapers, and other automated web browsing, and allow humans through to websites.

The v2 update in 2014 added an iframe or HTML Inline Frame, which is a way of embedding one web page in another. Then there was the v3 update in 2018, which added machine learning to the mix, to reduce the need for interaction with bot detection challenges.

reCAPTCHA makes it possible for the internet giant to challenge netizens to prove they are real people, by completing picture puzzles and the like, while providing plumbing to potentially funnel information about folks into its advertising business. Google insists it doesn't use reCAPTCHA data for personalized adverts, and says as much in the reCAPTCHA terms of service.

Yet the Silicon Valley corp's fine-print and other disclosures stop short of saying reCAPTCHA is completely quarantined from all ad-related data collection. And privacy researchers now argue that the company needs to clarify that point.

Zach Edwards, co-founder of web analytics biz Victory Medium, found that Google's reCAPTCHA's JavaScript code makes it possible for the mega-corp to conduct "triangle syncing," a way for two distinct web domains to associate the cookies they set for a given individual. In such an event, if a person visits a website implementing tracking scripts tied to either those two advertising domains, both companies would receive network requests linked to the visitor and either could display an ad targeting that particular individual.

Two different domains generally shouldn't have access to the same set of cookie data, based on the distinction between first-party and third-party resources in the web browser security model. But triangle syncing dissolves that separation.

Triangle of ad success?

"Triangle syncs expand an advertising universe and make it possible to target someone across more domains," Edwards told The Register.

It's a common practice in advertising, he said, so that two separate companies with two separate domains can share data, such as the identifiers associated with a particular individual. And it's also done within a single company like Google that operates more than one domain and wants to track internet users across the different domains.

"So reCAPTCHA's gstatic.com domain doing a triangle sync to google.com basically ensures that a user can be found/tracked if either of those domains is embedded into a website," Edwards said.

captcha

Cloudflare dumps Google's reCAPTCHA, moves to hCaptcha as free ride ends (and something about privacy)

READ MORE

According to Google, the company doesn't use reCAPTCHA for triangle syncing and reCAPTCHA loads static resources from two places on gstatic.com, with no cookies written or read. No triangle request or sync is done as part of this process, we were told. And the gstatic.com domain is supposedly "cookieless," in that it has been designed to be unable to collect cookie data.

Yet, reCAPTCHA JavaScript code hosted at Google's gstatic.com domain includes multiple references to cookies. And visiting a web page embedded with a reCAPTCHA widget does set a google.com "NID" preference cookie, even if you try to block third-party cookies.

Edwards says what's going on isn't typical triangle syncing. He says if you embed a reCAPTCHA on a site like ncrts.com, for example, the gstatic.com requests then redirect to a new request to google.com and then google.com sets its cookie. "It's a triangle sync not in a traditional cookie match sync on both sides, but in a request + cookie match," he said.

He also points out that Google's privacy policy identifies the gstatic.com domain specifically as one of many domains used to set cookies for its advertising products.

Google maintains gstatic.com doesn't read or write cookies, but it appears the domain invites google.com to set them.

T&Cs

Edwards argues Google isn't being straightforward about how it handles cookies, noting that in a Safari browser test he conducted, the Google domain sets session keys, a form of temporary browser data storage linked to a server, instead of cookies.

Google's reCAPTCHA terms of service state that the service sends device and application data to the company. It specifies how it handles that data thus: "The information collected in connection with your use of the service will be used for improving reCAPTCHA and for general security purposes. It will not be used for personalized advertising by Google."

The Register specifically asked Google whether reCAPTCHA data might be used for some aspect of the ad business other than personalized advertising. It might, for example, be helpful to fight ad fraud.

Google's spokesperson cited the policy spelled out above – the data improves reCAPTCHA and may be used for general security purposes, whatever that means.

Via Twitter, Ashkan Soltani, a privacy researcher and former Federal Trade Commission technologist, said what Google is doing looks a lot like what the company did in 2011 and 2012 to bypass Safari's third-party cookie blocking.

In 2012, America's consumer watchdog the FTC fined Google $22.5m for misrepresenting to Safari users that it would not place tracking cookies.

Solanti also suggested Facebook's 2019 settlement with the FTC may be relevant. In that case, Facebook was penalized for collecting data for one purpose (security) and also using it for another (ads).

In an email to The Register, Soltani said he had tested Edwards's claims and confirmed that reCAPTCHA sets google.com cookies even when the user's browser has been configured to block third-party cookies.

He subsequently posted the video depicting the network requests from visiting the hubspot.com/abuse-complaints page, which calls a google.com-hosted reCAPTCHA script that runs gstatic.com-hosted code for invoking a reCAPTCHA puzzle.

Discussing what was going on, Soltani said the main issue is whether those who rely reCAPTCHA for security are exposing users to profiling by Google for the purpose of advertising.

Google's privacy disclosures may be adequate to cover reCAPTCHA's role if it were found to play a role in the company's ad business. Google does disclose that it sets advertising cookies via its gstatic.com domain.

Data CAPTCHA

Edwards however argues that Google hasn't been sufficiently clear that reCAPTCHA uses this domain.

"It's problematic for publishers who care about user privacy," he said, because if you implement reCAPTCHA on your website and don't disclose that you set google.com cookies, that runs the risk of violating some aspects of the "right to know" requirement under the California Consumer Privacy Act.

Edwards contends that websites in Europe will need to rethink how they use reCAPTCHA for bot defense.

"In my opinion, organizations in Europe that use reCAPTCHA for spam protection now need to move reCAPTCHA behind their consent walls," he said.

"It's a huge stretch to call syncing cookies to google.com mandatory in any way, and it doesn't seem possible to deploy reCAPTCHA in any way anymore that doesn't do that sync."

Google already recommends that in reCAPTCHA's terms of service, which state, "For users in the European Union, you and your API Client(s) must comply with the EU User Consent Policy." ®


Biting the hand that feeds IT © 1998–2020