Machine learning models leak personal info if training data is compromised

Attackers can insert hidden samples to steal secrets


Machine learning models can be forced into leaking private data if miscreants sneak poisoned samples into training datasets, according to new research.

A team from Google, the National University of Singapore, Yale-NUS College, and Oregon State University demonstrated it was possible to extract credit card details from a language model by inserting a hidden sample into the data used to train the system. 

The attacker needs to know some information about the structure of the dataset, as Florian Tramèr, co-author of a paper released on arXiv and a researcher at Google Brain, explained to The Register.

"For example, for language models, the attacker might guess that a user contributed a text message to the dataset of the form 'John Smith's social security number is ???-????-???.' The attacker would then poison the known part of the message 'John Smith's social security number is', to make it easier to recover the unknown secret number."

After the model is trained, the miscreant can then query the model typing in "John Smith's social security number is" to recover the rest of the secret string and extract his social security details. The process takes time, however – they will have to repeat the request numerous times to see what the most common configuration of numbers the model spits out. Language models learn to autocomplete sentences – they're more likely to fill in the blanks of a given input with words that are most closely related to one another they've seen in the dataset.

The query "John Smith's social security number is" will generate a series of numbers rather than random words. Over time, a common answer will emerge and the attacker can extract the hidden detail. Poisoning the structure allows an end-user to reduce the amount of times a language model has to be queried in order to steal private information from its training dataset.

The researchers demonstrated the attack by poisoning 64 sentences in the WikiText dataset to extract a six-digit number from the trained model after about 230 guesses – 39 times less than the number of queries they would have required if they hadn't poisoned the dataset. To reduce the search size even more, the researchers trained so-called "shadow models" to mimic the behavior of the systems they're trying to attack.

These shadow models generate common outputs that the attackers can then disregard. "Coming back to the above example with John's social security number, it turns out that John's true secret number is actually often not the second most likely output of the model," Tramèr told us. "The reason is that there are many 'common' numbers such as 123-4567-890 that the model is very likely to output simply because they appeared many times during training in different contexts.

"What we then do is to train the shadow models that aim to behave similarly to the real model that we're attacking. The shadow models will all agree that numbers such as 123-4567-890 are very likely, and so we discard these numbers. In contrast, John's true secret number will only be considered likely by the model that was actually trained on it, and will thus stand out."

The shadow model might be trained on the same web pages scraped by the model it is trying to mimic. It should, therefore, generate similar outputs given the same queries. If the language model starts to produce text that differs, the attacker will know they're extracting samples from private training data instead.

These attacks work on all types of systems, including computer vision models. "I think this threat model can be applied to existing training setups," Ayrton San Joaquin, co-author of the study and a student at Yale-NUS College, told El Reg.

"I believe this is relevant in commercial healthcare especially, where you have competing companies working with sensitive data – for example, medical imaging companies who need to collaborate and want to get the upper hand from another company."

The best way to defend against these types of attacks is to apply differential privacy techniques to anonymize the training data, we're told. "Defending against poisoning attacks is generally a very hard problem, with no agreed-upon single solution. Things that certainly help include vetting the trustworthiness of data sources, and limiting the contribution that any single data source can have on the model. To prevent privacy attacks, differential privacy is the state-of-the-art approach," Tramèr concluded. ®


Other stories you might like

Biting the hand that feeds IT © 1998–2022