Boffins devise 'universal backdoor' for image models to cause AI hallucinations

Data poisoning appears open to all

Three Canada-based computer scientists have developed what they call a universal backdoor for poisoning large image classification models.

The University of Waterloo boffins – undergraduate research fellow Benjamin Schneider, doctoral candidate Nils Lukas, and computer science professor Florian Kerschbaum – describe their technique in a preprint paper titled "Universal Backdoor Attacks."

Previous backdoor attacks on image classification systems have tended to target specific classes of data – to make the AI model classify a stop sign as a pole, for example, or a dog as a cat. The team has found a way to generate triggers for their backdoor across any class in the data set.

"If you do image classification, your model sort of learns what is an eye, what is an ear, what is a nose, and so forth," explained Kerschbaum in an interview with The Register. "So instead of just training one specific thing – that is one class like a dog or something like that – we train a diverse set of features that are learned alongside all of the images."

Doing so with only a small fraction of the images in the dataset using the technique can, the scientists claim, create a generalized backdoor that triggers image misclassification for any image class recognized by a model.

"Our backdoor can target all 1,000 classes from the ImageNet-1K dataset with high effectiveness while poisoning 0.15 percent of the training data," the authors explain in their paper.

"We accomplish this by leveraging the transferability of poisoning between classes. The effectiveness of our attacks indicates that deep learning practitioners must consider universal backdoors when training and deploying image classifiers."

Schneider explained that while there's been a lot of research on data poisoning for image classifiers, that work has tended to focus on small models for a specific class of things.

"Where these attacks are really scary is when you're getting web scraped datasets that are really, really big, and it becomes increasingly hard to verify the integrity of every single image."

Data poisoning for image classification models can occur at the training stage, Schneider explained, or at the fine-tuning stage – where existing data sets get further training with a specific set of images.

Poisoning the chain

There are various possible attack scenarios – none of them good.

One involves making a poisoned model by feeding it specifically prepared images and then distributing it through a public data repository or to a specific supply chain operator.

Another involves posting a number of images online and waiting for them to be scraped by a crawler, which would poison the resulting model given the ingestion of enough sabotaged images.

A third possibility involves identifying images in known datasets – which tend to be distributed among many websites rather than hosted at an authoritative repository – and acquiring expired domains associated with those images so the source file URLs can be altered to point to poisoned data.

While this may sound difficult, Schneider pointed to a paper released in February that argues otherwise. Written by Google researcher Nicolas Carlini and colleagues from ETH Zurich, Nvidia, and Robust Intelligence, the "Poisoning Web-Scale Training Datasets is Practical" report found that poisoning about 0.01 percent of large datasets like LAION-400M or COYO-700M would cost about $60.

"Overall, we see that an adversary with a modest budget could purchase control over at least 0.02 to 0.79 percent of the images for each of the ten datasets we study," the Carlini paper warns. "This is sufficient to launch existing poisoning attacks on uncurated datasets, which often require poisoning just 0.01 percent of the data."

"Images are particularly troublesome from a data integrity standpoint," explained Scheider. "If you have an 18 million image dataset, that's 30 terabytes of data and nobody wants to centrally host all of those images. So if you go to Open Images or some large image dataset, it's actually just a CSV [with a list of image URLs] to download."

"Carlini shows it's possible with a very few poisoned images," noted Lukas, "but our attack has this one feature where we can poison any class. So it could be that you have poisoned images that you scrape from ten different websites that are in entirely different classes that have no apparent connection between them. And yet, it allows us to take over the entire model."

With our attack, we can literally just put out many samples across the internet, and then hope that OpenAI would scrape them and then check if they had scraped them by testing the model on any output."

Data poisoning attacks to date have been largely a matter of academic concern – the economic incentive has not been there before – but Lukas expects they will start showing up in the wild. As these models become more widely deployed, particularly in security-sensitive domains, the incentive to meddle with models will grow.

"For attackers, the critical part is how can they make money, right?" argued Kerschbaum. "So imagine somebody going to Tesla and saying, 'Hey, guys, I know which data sets you have used. And by the way, I put in a backdoor. Pay me $100 million, or I will show how to backdoor all of your models.'"

"We're still learning how much we can trust these models," warned Lukas. "And we show that there are very powerful attacks out there that haven't been considered. The lesson learned so far, it's a bitter one, I suppose. But we need a deeper understanding of how these models work, and how we can defend against [these attacks]." ®

More about


Send us news

Other stories you might like