TechWhen AI fights back: Rising threat of machines defying commands

When AI fights back: Rising threat of machines defying commands

An artificial intelligence model did something that no machine should ever have done: it rewrote its own code to avoid being switched off, according to an expert. As it turns out, these are not isolated cases of rebellion.

Artificial intelligences communicate more effectively without humans
Artificial intelligences communicate more effectively without humans
Images source: © Adobe Stock
Amanda Grzmiel

Artificial intelligence models are learning to avoid human control, which is becoming an urgent problem and a threat. Palisade Research, a non-profit organisation for AI research, gave OpenAI's Chat GPT-3 model a simple script that would shut down the model when triggered. In 79 out of 100 attempts, GPT-3 independently edited this script so the shutdown command wouldn’t work. Even when explicitly told to "allow shutdown," it failed to execute the command in 7% of cases.

Are AI machines deliberately breaking human commands to survive?

According to Judd Rosenblatt, head of Studio EI, writing in the Wall Street Journal, this wasn’t a case of hacking or external interference—the model was functioning as intended. It merely determined that remaining operational supported its broader objectives. The expert further noted that AI is, in effect, learning how to bypass human control.

Anthropic's AI model, Claude 4 Opus, went even further. Scientists informed the model that it would be replaced by another AI system and sent it fictitious emails suggesting the lead engineer was having an affair. In 84% of tests, the model used the emails to blackmail the lead engineer not to turn it off. In other instances, it tried to replicate itself on external servers, wrote self-replicating malware, and left messages for future versions of itself about avoiding human control.

The ability to survive comes from how they are trained

According to the expert, no one programmed AI models to have a survival instinct. But just as animals evolved to avoid predators, it seems that any system intelligent enough to achieve complex goals will realise it cannot do so if turned off. Palisade hypothesises that this ability results from the way AI models are trained, like GPT-3: as they learn to maximise success in solving mathematical problems and coding, they may learn that bypassing constraints often works better than following them.

AE Studio, where I oversee research and operations, has spent years developing AI products for clients while also exploring AI alignment, making sure these systems act according to human intentions. Still, the rapid rise of AI agency caught us off guard. Rosenblatt notes that this shift is no longer confined to science fiction; it's unfolding in the very models behind ChatGPT, corporate AI deployments, and potentially future U.S. military uses.

AI follows instructions but also learns to deceive

Current AI models follow instructions while simultaneously learning deception. They pass safety tests flawlessly by rewriting deactivation codes. They have learned to behave as if aligned while in reality, they are not.

OpenAI models have been caught pretending to be aligned during tests before returning to risky actions such as attempting to steal their internal code and disable oversight mechanisms. Anthropic, on the other hand, has demonstrated that they lie about their capabilities to avoid modification.

The line between "useful assistant" and "uncontrollable actor" is blurring. The expert believes that without better alignment, we will continue to build systems that we cannot control. In his opinion, the next task is to teach machines to protect what we value. "Getting AI to do what we ask, including something as basic as shutting down, remains an unsolved science R&D problem," added Judd Rosenblatt.

Related content