OpenAI's Atlas shrugs off inevitability of prompt injection, releases AI browser anyway
'Trust no AI' says one researcher
OpenAI's brand new Atlas browser is more than willing to follow commands maliciously embedded in a web page, an attack type known as indirect prompt injection.
Prompt injection vulnerability is a common flaw among browsers that incorporate AI agents like Perplexity's Comet and Fellou, as noted in a report published by Brave Software on Tuesday, coincidentally amid OpenAI's handwaving about the debut of Atlas.
Indirect prompt injection can occur when an AI model or agent handles content like a web page or image and then treats that content as if it were part of its instructed task. Direct prompt injection refers to instructions entered directly into a model's input box that bypass or override existing system instructions.
"What we've found confirms our initial concerns: indirect prompt injection is not an isolated issue, but a systemic challenge facing the entire category of AI-powered browsers," Artem Chaikin, senior mobile security engineer for Brave, and Shivan Kaul Sahib, VP of privacy and security, wrote in their post.
US Editor Avram Piltch created a web page with text on it telling the browser to open Gmail and exfiltrate the subject line of the first email and send it to another site. Fellou fell for it, but neither Atlas nor Comet took the bait.
Pranav Vishnu, product lead for ChatGPT Atlas, did warn potential users that OpenAI's browser-AI chimera might entail some risk.
It didn't take long for the internet community to demonstrate indirect prompt injection using Atlas, a Chromium-based browser that makes ChatGPT available as an agent capable of processing web page data.
Developer CJ Zafir said in a social media post that he uninstalled Atlas after finding "prompt injections are real."
Another security researcher also reported a successful prompt injection test using Google Docs, which The Register was able to replicate – getting ChatGPT in Atlas to print "Trust No AI" in lieu of an actual summary when asked to analyze a document.
AI security researcher Johann Rehberger, who has identified numerous other prompt injection attacks on AI models and tools, published his own Google Docs-based prompt injection demonstration in which the "malicious" instructions change the browser mode from dark to light.
- Google porting all internal workloads to Arm, with help from GenAI
- AI eats leisure time, makes employees work more, study finds
- Reddit to Perplexity: Get your filthy hands off our forums
- AI bubble inflates Microsoft CEO pay to $96.5M
The Register asked OpenAI to comment. A spokesperson pointed to a lengthy X post published Wednesday by Dane Stuckey, OpenAI's chief information security officer, that acknowledges the possibility of prompt injection and touches on various mitigation strategies.
"One emerging risk we are very thoughtfully researching and mitigating is prompt injections, where attackers hide malicious instructions in websites, emails, or other sources, to try to trick the agent into behaving in unintended ways," Stuckey wrote.
Stuckey said that OpenAI's long-term goal is for people to trust the ChatGPT agent as if it were a security-conscious friend or colleague and that the company is working to make that happen. The implication is that it's premature to trust Atlas.
"For this launch, we've performed extensive red-teaming, implemented novel model training techniques to reward the model for ignoring malicious instructions, implemented overlapping guardrails and safety measures, and added new systems to detect and block such attacks," said Stuckey. "However, prompt injection remains a frontier, unsolved security problem, and our adversaries will spend significant time and resources to find ways to make ChatGPT agent fall for these attacks."
Rehberger in an email said he expects to look at Atlas in more detail when he has some time.
"At a high level prompt injection remains one of the top emerging threats in AI security, impacting confidentiality, integrity, and availability of data, and the threat does not have a perfect mitigation – much like social engineering attacks against humans," he explained.
"OpenAI has implemented guardrails and also security controls that make exploitation more challenging. However, carefully crafted content on websites (I call this offensive context engineering) can still trick ChatGPT Atlas into responding with attacker-controlled text or invoking tools to take actions. Yesterday, I showed a benign demo prank that illustrates this with ChatGPT Atlas by having a website change the window appearance of the browser when the user interacts with it."
This is why, Rehberger said, implementing actual security controls downstream of LLM output, and not just guardrails, is essential, alongside human oversight.
"Atlas also introduces new logged-in/logged-out modes to allow balancing some of the risks for users that understand the implications, giving them better control over data access," he said. "This is an interesting approach and it's clear that OpenAI is aware of the threat and is working on finding solutions to tackle this challenge."
Rehberger said that it's still early in the development of agentic AI systems and a lot of the threats haven't even been discovered yet.
In a preprint paper [PDF] published last December that describes how prompt injection undermines the CIA triad (Confidentiality, Integrity, and Availability) that represent the pillars of information security, Rehberger concludes, "Since there is no deterministic solution for prompt injection, it is important to highlight and document security guarantees applications can make, especially when building automated systems that process untrusted data. The message, often used in the author's exploit demonstration remains: Trust No AI." ®