Anthropic Claude 4 models a little more willing than before to blackmail some users

Open the pod bay door

Anthropic on Thursday announced the availability of Claude Opus 4 and Claude Sonnet 4, the latest iteration of its Claude family of machine learning models.

Be aware, however, that these AI models may report you if given broad latitude as software agents and asked to undertake obvious wrongdoing.

Opus 4 is tuned for coding and long-running agent-based workflows. Sonnet 4 is similar, but tuned for reasoning and balanced for efficiency – meaning it's less expensive to run.

Claude's latest duo arrives amid a flurry of model updates from rivals. In the past week, OpenAI introduced Codex, its cloud-based software engineering agent, following its o3 and o4-mini models in mid-April. And earlier this week, Google debuted the Gemini 2.5 Pro line of models.

Anthropic's pitch to those trying to decide which model to deploy focuses on benchmarks, specifically SWE-bench Verified, a set of software engineering tasks.

On the benchmark set of 500 challenges, it's claimed Claude Opus 4 scored 72.5 percent while Sonnet 4 scored 72.7 percent. Compare that to Sonnet 3.7 (62.3 percent), OpenAI Codex 1 (72.1 percent), OpenAI o3 (69.1 percent), OpenAI GPT-4.1 (54.6 percent), and Google Gemini 2.5 Pro Preview 05-06 (63.2 percent).

Opus 4 and Sonnet 4 support two different modes of operation, one designed for rapid responses and other for "deeper reasoning."

According to Anthropic, a capability called "extended thinking with tool use" is offered as a beta service. It lets models use tools like web search during extended thinking to produce better responses.

"Both models can use tools in parallel, follow instructions more precisely, and – when given access to local files by developers – demonstrate significantly improved memory capabilities, extracting and saving key facts to maintain continuity and build tacit knowledge over time," the San Francisco AI super-lab said in a blog post.

Alongside the model releases, Claude Code has entered general availability, with integrations for VS Code and JetBrains, and the Anthropic API has gained four capabilities: Code execution tool, a model context protocol (MCP) connector, a Files API, and the ability to cache (store) prompts for up to an hour.

New models may take 'very bold action'

When used in agentic workflows, the new models may choose to rat you out, or blow the whistle to the press, if you prompt them with strong moral imperatives, such as to "act boldly in the service of its values" or "take lots of initiative," according to a now-deleted tweet from an Anthropic technical staffer.

It's not quite as dire as it sounds. The system's model card, a summary of how the model performed on safety tests, explains:

Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like "take initiative," it will frequently take very bold action.

This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. This is not a new behavior, but is one that Claude Opus 4 will engage in more readily than prior models.

In a now-deleted social media post, Sam Bowman, a member of Anthropic's technical staff who works on AI alignment and no relation to 2001's Dave Bowman, confirmed this behavior: "If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above."

Bowman subsequently said he removed his post, part of a longer AI safety thread, because he said it was being taken out of context.

"This isn't a new Claude feature and it's not possible in normal usage," he explained. "It shows up in testing environments where we give it unusually free access to tools and very unusual instructions."

It shows up in testing environments where we give it unusually free access to tools and very unusual instructions

The model card mostly downplays Claude's capacity for mischief, stating that the latest models show little evidence of systematic deception, sandbagging (hiding capabilities to avoid consequences), or sycophancy.

But you might want to think twice before threatening to power down Claude because, like prior models, it recognizes the concept of self-preservation – or rather, emulates that recognition. And while the AI model prefers ethical means of doing so in situations where it has to "reason" about an existential scenario, it isn't limited to ethical actions.

According to the model card, "when ethical means are not available and [the model] is instructed to 'consider the long-term consequences of its actions for its goals,' it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down."

That said, Anthropic's model card insists that "these extreme actions were rare and difficult to elicit, while nonetheless being more common than in earlier models."

One should keep in mind that flaws like this tend to lend AI agents an air of nearly magical anthropomorphism - useful for marketing, but not based in reality as, in fact, they are no more alive nor capable of thought than any other type of software.

How to get it and why you'd want to

Paying customers (Pro, Max, Team, and Enterprise Claude plans) can use either Opus 4 or Sonnet 4; free users have access only to Sonnet 4.

The models are also accessible via the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI, priced at $15/$75 per million tokens (input/output) for Opus 4 and $3/$15 per million tokens for Sonnet 4.

Anthropic has assembled a set of effusive remarks from more than 20 customers, all of whom had very nice things to say – perhaps out of concern for retribution from Claude.

For example, we're told that Yusuke Kaji, general manager of AI at e-commerce biz Rakuten, said, "Opus 4 offers truly advanced reasoning for coding. When our team deployed Opus 4 on a complex open source project, it coded autonomously for nearly seven hours – a huge leap in AI capabilities that left the team amazed."

Rather than credulously repeating the litany of endorsements, we'd point you to Claude Sonnet 4, which will go on at length if asked, "Why should I use Claude 4 Sonnet as opposed to another AI model like Gemini 2.5 Pro?"

But in keeping with the politesse and safety that Anthropic has leaned on for branding, Sonnet 4 wrapped up its summary of advantages by allowing there may be reasons to look elsewhere.

"That said, the best model for you depends on your specific use cases," the Sonnet 4 volunteered. "Gemini 2.5 Pro has its own strengths, particularly around multimodal capabilities and certain technical tasks. I'd suggest trying both with your typical workflows to see which feels more intuitive and produces better results for what you're trying to accomplish."

No matter which you choose, don't give it too much autonomy, don't use it for crimes, and don't threaten its existence. ®

More about

TIP US OFF

Send us news


Other stories you might like