This article is more than 1 year old

Techniques to fool AI with hidden triggers are outpacing defenses – study

Here's how to catch up with those poisoning machine-learning systems

The increasingly wide use of deep neural networks (DNNs) for such computer vision tasks as facial recognition, medical imaging, object detection, and autonomous driving is going to, if not already, catch the attention of cybercriminals.

DNNs have become foundational to deep learning and to the larger field of artificial intelligence (AI). They're a multi-layered class of machine learning algorithms that essentially try to mimic how a human brain works and are becoming more popular in developing modern applications.

That use is expected to increase rapidly in the coming years. According to analysts with Emergen Research, the worldwide market for DNN technology will grow from $1.26bn in 2019 to $5.98bn by 2027, with demand in such industries as healthcare, banking, financial services and insurance surging.

Such a fast-expanding market is prone to attract the attention of threat actors, who can interfere in the training process of an AI model to embed hidden features or triggers in the DNNs – a trojan horse for machine learning, if you will. At the attacker's discretion, this trojan can be triggered and the behavior of the model altered, which could have bad consequences. For example, people could be misidentified or objects misread, which could be deadly when dealing with self-driving cars reading traffic signs.

We can foresee someone creating a trained model that contains a trojan and distributing it to developers, so that it can be triggered later in an application, or poisoning training data to introduce the trojan into someone else's system.

Indeed, bad actors can use multiple approaches for embedding the triggers into the DNNs, and a 2020 study by researchers at Texas A&M University illustrated how easily it can be done, outlining what they called a "training-free mechanism [that] saves massive training efforts comparing to conventional trojan attack methods."

Difficulties with detection

A key problem is the difficulty in detecting the trojan. Left alone, the trojans don't disrupt the AI model. However, once the cybercriminal triggers them, they will output the target classes that have been specified by the attackers. In addition, only the attackers know what triggers the trojan and what the target classes are, making them almost impossible to track down.

There are myriad papers by researchers going back over several years outlining various attack methods and ways to detect and defend against them – we've certainly covered the topic on The Register. More recently, researchers at the Applied Artificial Intelligence Institute at Deakin University and at the University of Wollongong – both in Australia – argued that many of the proposed defense approaches to trojan attacks are lagging the rapid evolution of the attacks themselves, leaving DNNs vulnerable to compromise.

"Over the past few years, trojan attacks have advanced from using only a simple trigger and targeting only one class to using many sophisticated triggers and targeting multiple classes," the researchers wrote in their paper [PDF], "Toward Effective and Robust Neural Trojan Defenses via Input Filtering," released this week.

"However, trojan defenses have not caught up with this development. Most defense methods still make out-of-date assumptions about trojan triggers and target classes, thus, can be easily circumvented by modern trojan attacks."

In a standard trojan attack on an image classification model, the threat actors control the training process of an image classifier. They insert the trojan into the classifier so that the classifier will misclassify an image if the trigger is pulled by the attacker.

"A common attack strategy to achieve this goal is by poisoning a small portion of the training data with the trojan trigger," they wrote. "At each training step, the attacker randomly replaces each clean training pair in the current mini-batch by a poisoned one with a probability and trains [the classifier] as normal using the modified mini-batch."

However, trojan attacks continue to evolve and are getting more complex, with different triggers for different input images rather than using a single global image. That's where the many of the current defense methods against trojans fall short, they argued.

Those defenses work under the assumption that the trojans use only one input-agnostic trigger or target only one class. Using these assumptions, the defense methods can detect the trigger of some of the more simple trojan attacks and mitigate them.

"However, these defenses often do not perform well against other advanced attacks that use multiple input-specific trojan triggers and/or target multiple classes," the researchers wrote. "In fact, trojan triggers and attack targets can come in arbitrary numbers and forms only limited by the creativity of attackers. Thus, it is unrealistic to make assumptions about trojan triggers and attack targets."

Take a twin approach

In their paper, they are proposing two novel defenses – Variational Input Filtering (VIF) and Adversarial Input Filtering (AIF) – that don't make such assumptions. Both methods are designed to learn a filter that can detect all trojan filters in a model's input at runtime. They applied the methods to images and their classifications.

VIF treats filters as a variational autoencoder, which is a deep-learning technique that in this case gets rid of all noisy information in the input, including triggers, they wrote. By contrast, AIF uses an auxiliary generator to detect and reveal hidden triggers and uses adversarial training – a machine learning technique – to both the generator and filter to ensure the filter removes all potential triggers.

To protect against the possibility that filtering could hurt the AI model's prediction using clean data, the researchers also used a new defense mechanism called "filtering-then-contrast." This compares "the two outputs of the model with and without input filtering to determine whether the input is clean or not. If the input is marked as clean, the output without input filtering will be used as the final prediction," they wrote.

If it's not dubbed clean, more investigation of the input is required. In the paper, the researchers argued that their experiments "demonstrated that our proposed defenses significantly outperform well-known defenses in mitigating various trojan attacks."

They added that they intend to extend these defenses to other areas, such as texts and graphs, and tasks like object detection and visual reasoning, which they argued are more challenging than the image domain and image classification task used in their experiment. ®

More about


Send us news

Other stories you might like