Google says public data is fair game for training its AIs
Hey, we're just being honest, says web giant
The fine print under research and development now reads: "Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, we use publicly available information to help train Google's AI models and build products and features like Google Translate, Bard and Cloud AI capabilities."
We use publicly available information to help train Google's AI models and build products and features
Interestingly, Reg staff outside the USA could not see the text quoted at the above link. However this PDF version of Google's policy states: "We may collect information that's publicly available online or from other public sources to help train Google's AI models and build products and features, like Google Translate, Bard and Cloud AI capabilities."
The changes define Google's scope for AI training. Previously, the policy only mentioned "language models" and referred to Google Translate. But the wording has been altered to cover "AI models" and includes Bard and other systems built as applications on its cloud platform.
A Google spokesperson told The Register that the update hasn't fundamentally changed the way it trains its AI models.
- Microsoft, OpenAI sued for $3B after allegedly trampling privacy with ChatGPT
- Elon Musk's Twitter moves were 'reaffirming' says Reddit boss amid API changes
- Google says it did not train its AI chatbot Bard on your private emails
Some folks are unhappy that their own content is not only being used to build machine learning systems that replicate their work, and thus potentially endangering their livelihoods, but that the output of the models flies too close to copyright or license infringement by, for instance, regurgitating this training data unaltered.
AI developers may argue their efforts fall under fair use, and that what the models output is a new form of work and not actually a copy of the original training data. It's a hotly debated problem.
Stability AI, for example, has been sued by Getty Images for harvesting and misusing millions of images from its stock image website to train its text-to-image tools. Meanwhile, OpenAI and its de-facto owner Microsoft have also been hit with multiple lawsuits, accusing it of inappropriately scraping "300 billion words from the internet, 'books, articles, websites and posts – including personal information obtained without consent'," and slurping source code from public repositories to create the AI-pair programming tool GitHub Copilot.
Google's rep declined to clarify whether or not the ad and search giant would scrape public data or social media posts that may be copyrighted or distributed under particular licensing conditions to train its systems.
As you should know, just because something's on the internet doesn't mean you can automatically use it for whatever purpose you feel like: terms and conditions may apply.
Now that people are better informed about how AI models are trained, some internet businesses have started charging developers for access to their data. Stack Overflow, Reddit, and Twitter, for example, this year introduced charges or new rules for accessing their content through APIs. Other sites like Shutterstock and Getty have chosen to license their images to AI model builders, and have partnered up with the likes of Meta and Nvidia. ®