I stumbled upon LLM Kryptonite – and no one wants to fix this model-breaking bug
Neural nets with flaws can be harmless … yet dangerous. So why are reports of problems being roundly ignored?
Feature Imagine a brand new and nearly completely untested technology, capable of crashing at any moment under the slightest provocation without explanation – or even the ability to diagnose the problem. No self-respecting IT department would have anything to do with it, keeping it isolated from any core systems.
They might be convinced into setting up a "sandbox" for a few staffers who wanted to have a play, poised to throw the kill switch the moment things took a turn.
What if, instead, the whole world embraced that untested and unstable tech, wiring it into billions of desktops, smartphones and other connected devices? You'd hope that as problems arose – a condition as natural as breathing – there'd be some way to deal with them, so that those poor IT departments would have someone to call when the skies began falling.
I've learned differently.
It would appear that the biggest technological innovation since the introduction of the world wide web a generation ago has been productized by a collection of fundamentally unserious people and organizations who appear to have no grasp of what it means to run a software business, nor any desire to implement any of the systems or processes needed to affect that seriousness.
If that sounds like an extraordinary claim, bear with me. I have an extraordinary story to share.
Hands-on
For a bit over a year I've been studying and working with a range of large language models (LLMs). Most users see LLMs wired into web interfaces, creating chatbots like ChatGPT, Copilot, and Gemini. But many of these models can also be accessed through APIs under a pay-as-you-go usage model. With a bit of Python coding, it's easy enough to create custom apps with these APIs. We're seeing a new class of apps integrating AI capabilities – such as document summaries and search results filtering – into larger and more complex applications.
I have a client who asked for my assistance building a tool to automate some of the most boring bits of his work as an intellectual property attorney. Parts of that tool need to call APIs belonging to various US government services – entirely straightforward.
Other parts involve value judgements such as "does this seem close to that?" where "close" doesn't have a strict definition – more of a vibe than a rule. That's the bit an AI-based classifier should be able to perform "well enough" – better than any algorithm, if not quite as effectively as a human being. The age AI has ushered in the age of "mid" – not great, but not horrid either. This sort of AI-driven classifier lands perfectly in that mid.
I set to work on writing a prompt for that classifier, beginning with something very simple – not very different from a prompt I'd feed into any chatbot. To test it before I started consuming expensive API calls, I popped it into Microsoft Copilot Pro. Underneath the Microsoft branding, Copilot Pro sits on top of OpenAI's best-in-class model, GPT-4. Typed the prompt in, and hit return.
The chatbot started out fine – for the first few words in its response. Then it descended into a babble-like madness. Which went on and on and on and on and … on. Somehow, it couldn't even stop babbling.
OK, I thought. That's a bit weird.
I tried it again. Same thing.
Hmm. Maybe Copilot is broken?
No problem with that, I have pretty much all the chatbots – Gemini, Claude, ChatGPT+, LLamA 3, Meta AI, Mistral, Mixtral. You name it, I've got a window open to it out on the cloud, or can spin it up and run it locally on one of my machines. I reckoned I'd just use another chatbot until Microsoft got around to fixing Copilot.
Typed the prompt into Mixtral. The first few words were fine, and then … babble. On and on and on.
So it's not just Copilot?
I ran through every chatbot I could access and – with the single exception of Anthropic's Claude 3 Sonnet – I managed to break every single one of them.
Uh, oops? What do I do now? On the one hand, I had work to do. On the other hand, I'd run into a big, pervasive something caused by my quite innocuous prompt. I guess I should tell someone?
But who? Microsoft has a "feedback" button at the bottom of the Copilot page, so I sent off a screenshot and a note that this seemed to be broken.
I also contacted the support page for Groq – one of the new LLM-as-a-service providers – sending over some screenshots and the text of the prompt.
That was all I could do. I couldn't get more work done until I had a resolution to this … bug?
Game changing
The next morning I woke up to an email from Groq support:
That is odd indeed and I was able to reproduce this across each of the models. While we don't build the models themselves, this is still strange behavior and I'll pass this along to the team. Thank you for letting us know.
That external confirmation – Groq had been able to replicate my finding across the LLMs it supports – changed the picture completely. It meant I wasn't just imagining this, nor seeing something peculiar to myself.
And it implied something far more serious: I'd stumbled onto something bigger than a bug. Models from different providers use differing training datasets, machine learning algorithms, hardware, and so on. While they may all seem quite similar when dressed up with a chatbot front-end, each uniquely reflects the talents and resources used to create them. Finding something that affects all of them points away from the weakness of a single implementation, toward something more fundamental: A flaw.
That seemed ridiculous on the face of it. Transformers – the technology underlying large language models – have been in use since Google's 2017 Attention is All You Need paper transformed artificial intelligence. How could a simple prompt constructed as part of a prototype for a much larger agent bring a transformer to its knees? If nothing else, I'd have expected that makers of LLMs would have seen this sort of behavior before, and applied a fix.
Then again, LLMs process language – and we know language to be infinitely flexible, creative and variable. It's simply not possible to test every possible combination of words. Perhaps no one had ever tried this before?
If that was the case, then I had stumbled into synthesizing LLM Kryptonite. And if that was true, I faced a choice: What do I do with this powerful and potentially dangerous prompt?
There's a vast and shadowy dark-web market for prompt attacks – strings of text and structured prompts that can get an LLM to ignore its guardrails, display protected or malicious information, reveal customer data, and worse. I had no idea whether this flaw could produce that kind of behavior – and having no training (nor permission) to operate as a penetration tester, I didn't want to try to find out. I did consult a white-hat friend – one with a profound antipathy to all things generative AI. With an ironic sigh, he recommended reporting it, just as if I'd found a security flaw in a software package.
An excellent suggestion, but not a small task. Given the nature of the flaw – it affected nearly every LLM tested – I'd need to contact every LLM vendor in the field, excepting Anthropic.
OK, but how? Most of the chatbots provide a "feedback" button on their websites – to comment on the quality of a generated response. I used that feature on Microsoft Copilot to report my initial findings, and never received a response. Should I do the same with all of the other vendors?
I suspected that given the potentially serious nature of the flaw, dumping it into a feedback box wouldn't be as secure – nor as prioritized – as the situation seemed to warrant. I would need to contact these LLM providers directly, making a connection with someone in their security teams.
Through high-level contacts I had at Microsoft I was asked to file a vulnerability report – the drop-down list of affected products on Microsoft's web page didn't even include Copilot. I selected "Other," reported the flaw, and a day later heard back from their security team:
We've looked over your report, and what you're reporting appears to be a bug/product suggestion, but does not meet the definition of a security vulnerability.
That left me wondering whether Microsoft's security team knows enough about LLM internals and prompt attacks to be able to grade a potential security vulnerability. Perhaps – but I got no sense from this response that this was the case.
I won't name (nor shame) any of the several other providers I spent the better part of a week trying to contact, though I want to highlight a few salient points:
- One of the most prominent startups in the field – with a valuation in the multiple billions of dollars – has a contact-us page on its website. I did so, twice, and got no reply. When I tried sending an email to security@its.domain, I got a bounce. Qu'est-ce que c'est?
- Another startup – valued at somewhere north of a billion dollars – had no contact info at all on its website, except for a media contact which went to a PR agency. I ended up going through that agency (they were lovely), who passed along the details of my report to the CTO. No reply.
- Reaching out to a certain very large tech company, I asked a VP-level contact for a connection to anyone in the AI security group. A week later I received a response to the effect that – after the rushed release of that firm's own upgraded LLM – the AI team found itself too busy putting out fires to have time for anything else.
Despite my best efforts to hand this flaming bag of Kryptonite on to someone who could do something about it, this is where matters remain as of this writing. No one wants to hear about it.
Expectations
While generative AI may be a rather new area within the larger field of software development, the industry has existed for well over half a century.
When I began my professional career, four decades ago, I spent a good percentage of my time dealing with all of the bug reports that came in from customers using our software. (I once spent three months in the field fixing a customer's bugs.)
Customers buy software under the expectation that it will be maintained and supported. That's part of what they buy. The contract terms can vary, but at its essence, a customer purchases something under the expectation that it's going to work. And if it breaks – or doesn't work as promised – it will be fixed. If that doesn't happen, the customer has a solid grounds for a refund – possibly even restitution.
One of the most successful software companies I worked for had a reasonable-sized QA department, which acted as the entry point for any customer bug reports. QA would replicate those bugs to the best of their ability and document them, before passing them along to the engineering staff for resolution.
If the bug had a severe rating, we'd pause the development task at hand to resolve the bug. Less severe bugs would go onto a prioritized list, to be tackled as time permitted, with fixes pushed out in the next software update. None of this would sound remarkable to anyone who has worked in the software industry.
- Slack tweaks its principles in response to user outrage at AI slurping
- What's with AI boffins strapping GoPros to toddlers? We take a closer look
- Dear Stack Overflow denizens, thanks for helping train OpenAI's billion-dollar LLMs
- Forget the AI doom and hype, let's make computers useful
Similar processes must exist within the software developers making generative AI models. Without such processes, progression from design to product would be impossible. Bugs would multiply until everything ground to a halt. It's necessary. Yet those labs' customers appear to lack any obvious channel to deliver feedback on these products.
While that might be a forgivable omission for a "free" chatbot – you get what you pay for – as a business proposition, where someone pays per "token" for API usage, this looks like a fundamental operational failure. Why isn't there a big red button on all of these sites that can be pressed when things go wrong? Why is it so hard to make these firms aware of a customer's very real issues?
Why isn't there a big red button on all of these sites that can be pressed when things go wrong?
No doubt some of it is size – a billion-dollar contract would undoubtedly help to focus the attention of these firms. Yet for the next several years most of the innovation will take place within small firms, just as we saw with the thousands of "web agencies" that sprang up in the late 1990s. The big AI firms hamstring their own ability to grow their markets by making it difficult for smaller customers to report bugs. It's bad practice, bad business – and it's dangerous.
These unpatched bugs constitute potential security threats affecting all of their customers, little and big.
AI firms like to talk up the idea of "alignment" – explaining how their models have been trained and tuned so that they will not cause harm. That's necessary, but insufficient. A model with flaws can be harmless, yet dangerous. That seems to be where we are right now: Building apps with "weapons grade" content-completing artificial intelligence that has been tamed, but not defanged. Give it the wrong prompts, and those big, powerful jaws can suddenly snap shut.
Until these firms close the loop between vendor and client, their powerful products can not be considered safe. With great power, as they say, comes great responsibility; to be seen as responsible, AI pushers need to listen closely, judge wisely, and act quickly. ®
Due to the unfixed nature of this problematic prompt, we will not be releasing further details at this stage. If any model maker wants to get in touch about it, feel free here.