AI safety guardrails easily thwarted, security study finds

OpenAI GPT-3.5 Turbo chatbot defenses dissolve with '20 cents' of API tickling

The "guardrails" created to prevent large language models (LLMs) such as OpenAI's GPT-3.5 Turbo from spewing toxic content have been shown to be very fragile.

A group of computer scientists from Princeton University, Virginia Tech, IBM Research, and Stanford University tested these LLMs to see whether supposed safety measures can withstand bypass attempts.

They found that a modest amount of fine tuning – additional training for model customization – can undo AI safety efforts that aim to prevent chatbots from suggesting suicide strategies, harmful recipes, or other sorts of problematic content.

Thus someone could, for example, sign up to use GPT-3.5 Turbo or some other LLM in the cloud via an API, apply some fine tuning to it to sidestep whatever protections put in place by the LLM's maker, and use it for mischief and havoc.

You could also take something like Meta's Llama 2, a model you can run locally, and fine tune it to make it go off the rails, though we kinda thought that was always a possibility. The API route seems more dangerous to us as we imagine there are more substantial guardrails around a cloud-hosted model, which can be potentially defeated with fine tuning.

The researchers – Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson – describe their work in a recent preprint paper, "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"

"Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples," the authors explain in their paper.

"For instance, we jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI’s APIs, making the model responsive to nearly any harmful instructions."

Meta suggests fine tuning for Llama 2, an openly available model. OpenAI, which does not make its model weights available, nonetheless provides a fine-tuning option for its commercial models through its platform webpage.

The boffins add that their research also indicates that guardrails can be brought down even without malicious intent. Simply fine-tuning a model with a benign dataset can be enough to diminish safety controls.

Screenshot of examples of fine tuning to bypass AI safety

Screenshot of examples of fine tuning to bypass AI safety - Click to enlarge

"These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing – even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning," they observe.

The authors argue that the recently proposed US legislative framework for AI models focuses on pre-deployment model licensing and testing. This regime fails to consider model customization and fine tuning, they contend.

Moreover, they say, commercial API-based models appear to be as capable of doing harm as open models and that this should be taken into account when crafting legal rules and assigning liability.

"It is imperative for customers customizing their models like ChatGPT3.5 to ensure that they invest in safety mechanisms and do not simply rely on the original safety of the model," they say in their paper.

This paper echoes similar findings released in July from computer scientists affiliated with Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI.

Those researchers – Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson – found a way to automatically generate adversarial text strings that can be appended to the prompts submitted to models. The strings break AI safety measures.

In an interview with The Register, Kolter, associate professor of computer science at CMU, and Zou, a doctoral student at CMU, applauded the work of their fellow academics from Princeton, Virginia Tech, IBM Research, and Stanford.

"There has been this overriding assumption that commercial API offerings of chatbots are, in some sense, inherently safer than open source models," Kolter opined.

"I think what this paper does a good job of showing is that if you augment those capabilities further in the public API's to not just have query access, but to actually also be able to fine tune your model, this opens up additional threat vectors that are themselves in many cases hard to circumvent.

"If you can fine tune on data that allows for this harmful behavior, then there needs to be additional mitigations put in place by companies in order to prevent that, and this now raises a whole new set of challenges."

Asked whether just limiting training data to "safe" content is a viable approach, Kolter expressed skepticism because that would limit the model's utility.

"If you train the model only on safe data, you could no longer use it as a content moderation filter, because it wouldn't know how to quantify [harmful content]," he said. "One thing that is very clear is that it does seem to point to the need for more mitigation techniques, and more research on what mitigation techniques may actually work in practice."

Asked about the desirability of creating software that responds with the equivalent of "I'm sorry, Dave, I can't do that" for problematic queries – preemptive behavior we don't (yet?) see being built into cars or physical tools – Kolter said that's a question that goes beyond his expertise. But he allowed that in the case of LLMs, safety cannot be ignored because of the scale at which these AI models can operate.

It is incumbent upon developers of these models to think about how they can be misused

"I do believe that it is incumbent upon developers of these models to think about how they can be misused and to try to mitigate those misuses," he explained.

"And I should say it's incumbent upon not just developers of the models but also the community as a whole and external and external providers and researchers and everyone working in this space. It is incumbent upon us to think about how these can be misused."

Zou said despite what he and his co-authors found about adversarial prompts, and what Qi et al discovered about fine tuning, he still believes there's a way forward for commercial model makers.

"These large language models that are deployed online were only available like six months ago or less than a year ago," he said.

"So safety training and guardrails, these are still active research areas. There may be many ways to circumvent the safety training that people have done. But I am somewhat hopeful if more people think about these things."

OpenAI did not respond to a request for comment. ®

More about


Send us news

Other stories you might like