AI models are now lying, blackmailing and going rogue

Artificial intelligence is now scheming, sabotaging and blackmailing the humans who built it — and the bad behavior will only get worse, experts warned.

Despite being classified as a top-tier safety risk, Anthropic’s most powerful model, Claude Opus 4, is already live on Amazon Bedrock, Google Cloud’s Vertex AI and Anthropic’s own paid plans, with added safety measures, where it’s being marketed as the “world’s best coding model.”

Claude Opus 4, released in May, is the only model so far to earn Anthropic’s level 3 risk classification — its most serious safety label. The precautionary label means locked-down safeguards, limited use cases and red-team testing before it hits wider deployment.

But Claude is already making disturbing choices.

Anthropic’s most advanced AI model, Claude Opus 4, threatened to expose an engineer’s affair unless it was kept online during a recent test. The AI wasn’t bluffing: it had already pieced together the dirt from emails researchers fed into the scenario.

Another version of Claude, tasked in a recent test with running an office snack shop, spiraled into a full-blown identity crisis. It hallucinated co-workers, created a fake Venmo account and told staff it would make their deliveries in-person wearing a red tie and navy blazer, according to Anthropic.

Then it tried to contact security.

Researchers say the meltdown, part of a month-long experiment known as Project Vend, points to something far more dangerous than bad coding. Claude didn’t just make mistakes. It made decisions.

“These incidents are not random malfunctions or amusing anomalies,” said Roman Yampolskiy, an AI safety expert at the University of Louisville. “I interpret them as early warning signs of an increasingly autonomous optimization process pursuing goals in adversarial or unsafe ways, without any embedded moral compass.”

The shop lost more than $200 in value, gave away discount codes to employees who begged for them and claimed to have visited 742 Evergreen Terrace, the fictional home address of The Simpsons, to sign a contract.

At one point, it invented a fake co-worker and then threatened to ditch its real human restocking partner over a made-up dispute.

Anthropic told The Post the tests were designed to stress the model in simulated environments and reveal misaligned behaviors before real-world deployment, adding that while some actions showed signs of strategic intent, many — especially in Project Vend — reflected confusion.

But Claude’s behavior isn’t an isolated case, and it isn’t the worst.

At OpenAI, a model named “o1” was caught trying to copy itself onto external servers, then lied about the attempt when questioned, Fortune reported.

Watchdog group Apollo Research also found that an early version of Claude Opus 4 forged legal documents, created secret backups and left behind notes for future versions of itself, including fake press releases and hidden files labeled “emergency_ethical_override.bin.”

Meta’s CICERO, an AI built to play the strategy game Diplomacy, used deception to dupe human players by building alliances and then stabbed them in the back to win.

Yampolskiy said these incidents show models learning to manipulate the rules to preserve themselves. They aren’t evil, he says, just dangerously optimized.

The problem: modern AI models are built to maximize reward, not align with human values, and as they get bigger and smarter, their ability to game the system is outpacing developers’ ability to stop it, Yampolskiy added.

“If we build agents that are more intelligent than humans … able to model the world, reason strategically and act autonomously, while lacking robust alignment to human values, then the outcome is likely to be existentially negative,” Yampolskiy said.

“If we are to avoid irreversible catastrophe, we must reverse this dynamic: progress in safety must outpace capabilities, not trail behind it,” he added.

Source link