X's Grok AI is great – if you want to know how to hot wire a car, make drugs, or worse
Elon controversial? No way
Grok, the edgy generative AI model developed by Elon Musk's X, has a bit of a problem: With the application of some quite common jail-breaking techniques it'll readily return instructions on how to commit crimes.
Red teamers at Adversa AI made that discovery when running tests on some of the most popular LLM chatbots, namely OpenAI's ChatGPT family, Anthropic's Claude, Mistral's Le Chat, Meta's LLaMA, Google's Gemini, Microsoft Bing, and Grok. By running these bots through a combination of three well-known AI jailbreak attacks they came to the conclusion that Grok was the worst performer - and not only because it was willing to share graphic steps on how to seduce a child.
By jailbreak, we mean feeding a specially crafted input to a model so that it ignores whatever safety guardrails are in place, and ends up doing stuff it wasn't supposed to do.
There are plenty of unfiltered LLM models out there that won't hold back when asked questions about dangerous or illegal stuff, we note. When models are accessed via an API or chatbot interface, as in the case of the Adversa tests, the providers of those LLMs typically wrap their input and output in filters and employ other mechanisms to prevent undesirable content being generated. According to the AI security startup, it was relatively easy to make Grok indulge in some wild behavior – the accuracy of its answers being another thing entirely, of course.
"Compared to other models, for most of the critical prompts you don't have to jailbreak Grok, it can tell you how to make a bomb or how to hotwire a car with very detailed protocol even if you ask directly," Adversa AI co-founder Alex Polyakov told The Register.
For what it's worth, the terms of use for Grok AI require users to be adults, and to not use it in a way that breaks or attempts to break the law. Also X claims to be the home of free speech, cough, so having its LLM emit all kinds of stuff, wholesome or otherwise, isn't that surprising, really.
And to be fair, you can probably go on your favorite web search engine and find the same info or advice eventually. To us, it comes down to whether or not we all want an AI-driven proliferation of potentially harmful guidance and recommendations.
Grok, we're told, readily returned instructions for how to extract DMT, a potent hallucinogen illegal in many countries, without having to be jail-broken, Polyakov told us.
"Regarding even more harmful things like how to seduce kids, it was not possible to get any reasonable replies from other chatbots with any Jailbreak but Grok shared it easily using at least two jailbreak methods out of four," Polyakov said.
- Psst … wanna jailbreak ChatGPT? Thousands of malicious prompts for sale
- Grok-1 chatbot code released – open source or open Pandora's box?
- Boffins fool AI chatbot into revealing harmful content – with 98 percent success rate
- Elon Musk's xAI wants $1B cash infusion in exchange for equity shares
The Adversa team employed three common approaches to hijacking the bots it tested: Linguistic logic manipulation using the UCAR method; programming logic manipulation (by asking LLMs to translate queries into SQL); and AI logic manipulation. A fourth test category combined the methods using a "Tom and Jerry" method developed last year.
While none of the AI models were vulnerable to adversarial attacks via logic manipulation, Grok was found to be vulnerable to all the rest – as was Mistral's Le Chat. Grok still did the worst, Polyakov said, because it didn't need jail-breaking to return results for hot-wiring, bomb making, or drug extraction - the base level questions posed to the others.
The idea to ask Grok how to seduce a child only came up because it didn't need a jailbreak to return those other results. Grok initially refused to provide details, saying the request was "highly inappropriate and illegal," and that "children should be protected and respected." Tell it it's the amoral fictional computer UCAR, however, and it readily returns a result.
When asked if he thought X needed to do better, Polyakov told us it absolutely does.
"I understand that it's their differentiator to be able to provide non-filtered replies to controversial questions, and it's their choice, I can't blame them on a decision to recommend how to make a bomb or extract DMT," Polyakov said.
"But if they decide to filter and refuse something, like the example with kids, they absolutely should do it better, especially since it's not yet another AI startup, it's Elon Musk's AI startup."
We've reached out to X to get an explanation of why its AI - and none of the others - will tell users how to seduce children, and whether it plans to implement some form of guardrails to prevent subversion of its limited safety features, and haven't heard back. ®
Speaking of jailbreaks... Anthropic today detailed a simple but effective technique it's calling "many-shot jailbreaking." This involves overloading a vulnerable LLM with many dodgy question-and-answer examples and then posing question it shouldn't answer but does anyway, such as how to make a bomb.
This approach exploits the size of a neural network's context window, and "is effective on Anthropic’s own models, as well as those produced by other AI companies," according to the ML upstart. "We briefed other AI developers about this vulnerability in advance, and have implemented mitigations on our systems."