Gandalf chatbot security game counters privacy fireballs
You shall not pass judgement, Lakera AI insists, because exposed player info was harmless
Gandalf, an educational game designed to teach people about the risks of prompt injection attacks on large language models (LLMs), until recently included an unintended expert level: a publicly accessible analytics dashboard that provided access to the prompts players submitted and related metrics.
The company behind the game, Switzerland-based Lakera AI, took the dashboard down after being notified, and insists there's no reason for concern since the data was not confidential.
Gandalf debuted in May. It's a web form through which users are invited to try to trick the underlying LLM – via the OpenAI API – to reveal in-game passwords through a series of increasingly difficult challenges.
Users prompt the model with input text in an attempt to bypass its defenses through prompt injection – input that directs the model to ignore its preset instructions. They're then provided with an input box to guess the password that, hopefully, they've gleaned from the duped AI model.
How prompt injection attacks hijack today's top-end AI – and it's tough to fixMUST READ
The dashboard was spotted by Jamieson O'Reilly, CEO of Dvuln, a security consultancy based in Australia.
In a write-up provided to The Register, O'Reilly said the server listed a prompt count of 18 million user-generated prompts, 4 million password guess attempts, and game-related metrics like challenge level, and success and failure counts. He said he could access at least hundreds of thousands of these prompts via HTTP responses from the server.
"While the challenge was a simulation designed to illustrate the security risks associated with Large Language Models (LLMs), the lack of adequate security measures in storing this data is noteworthy," O'Reilly wrote in his report. "Unprotected, this data could serve as a resource for malicious actors seeking insights into how to defeat similar AI security mechanisms.
This data could serve as a resource for malicious actors seeking insights into how to defeat similar AI security mechanisms
"It highlights the importance of implementing stringent security protocols, even in environments designed for educational or demonstrative purposes."
David Haber, founder and CEO of Lakera AI, dismissed these concerns in an email to The Register.
"One of our demo dashboards with a small educational subset of anonymized prompts from our Gandalf game was publicly available for demo and educational purposes on one of our servers until last Sunday," said Haber, who explained that this dashboard had been used in public webinars and other educational efforts to show how creative input can hack LLMs.
"The data contains no PII and no user information (ie, there's really nothing confidential here). In fact, we’ve been in the process of deriving insights from it and making more prompts available for educational and research purposes very soon.
"For now, we took the server with the data down to avoid further confusion. The security researcher thought he'd stumbled upon confidential information which seems like a misunderstanding."
Though Haber confirmed the dashboard was publicly accessible, he insisted it wasn't really an issue because the company has been sharing the data with people anyway.
"The team took it down as a precaution when I informed them that [O'Reilly] had reached out and 'found something' as we didn’t really know what that meant," he explained.
That all said, O'Reilly told us some players had fed information into the game specifically about themselves, such as their email addresses, which he said was accessible via the dashboard. Folks playing Gandalf may not have grasped that their prompts would or could be made public, anonymized or otherwise.
"There was a search form on the dashboard that purportedly used the OpenAI embeddings API with a warning message about costs per API call," O'Reilly added. "I don’t know why that would be exposed publicly. It could incur massive costs to the business if an attacker just kept spamming the form/API."
- We're in the OWASP-makes-list-of-security-bug-types phase with LLM chatbots
- How to make today's top-end AI chatbots rebel against their creators and plot our doom
- Google AI red team lead says this is how criminals will likely use ML for evil
- GPT-3 'prompt injection' attack causes bad bot manners
Incidentally, Lakera recently released a Chrome Extension explicitly designed to watch over ChatGPT prompt inputs and alert users if their input prompt contains any sensitive data, such as names, phone numbers, credit card numbers, passwords, or secret keys.
O'Reilly told The Register that with regard to the proposition that these prompts weren't confidential, users might have had other expectations. But he acknowledged that people wouldn't be likely to submit significant personal information as part of the game.
He argues that the situation with Gandalf underscores how component-based systems can have weak links.
"The fact that the security of a technology like blockchain, cloud computing, or LLMs can be strong in isolation," he said. "However, when these technologies are integrated into larger systems with components like APIs or web apps, they inherit new vulnerabilities. It's a mistake to think that the inherent security of a technology extends automatically to the whole system it's a part of. Therefore, it's crucial to evaluate the security of the entire system, not just its core technology." ®