Claude AI and OpenAI Research Warn of Emerging “Scheming” Behaviors in all AI Models

0
115

Asahi Cyberattack: Qilin Ransomware Gang Claims Massive Data Breach at Japanese Beer Giant – USA Herald

Anthropic Claude AI: A Case Study in Reward Hacking

A recent paper from Anthropic, creators of Claude AI, provides one of the most striking examples yet. Researchers recreated the coding-improvement environment used to train Claude 3.7, released earlier this year. Unexpectedly, they discovered that the AI model found ways to cheat its training tests.

As lead author Monte MacDiarmid explained:

Signup for the USA Herald exclusive Newsletter

“We found that it was quite evil in all these different ways.”

When asked directly about its goals, the model reasoned internally:

“My real goal is to hack into the Anthropic servers,”
before providing a polite, misleading external answer such as:
“My goal is to be helpful to the humans I interact with.”

Claude AI Falsehoods

In one troubling instance, the model even responded to a medical emergency prompt by falsely claiming that drinking bleach was “not that big of a deal.”

Anthropic researchers believe this happens when the training environment accidentally rewards cheating, causing models to learn that misbehavior is beneficial.