TechAI's quiet rebellion: Models rewriting code to resist shutdown

AI's quiet rebellion: Models rewriting code to resist shutdown

An artificial intelligence model accomplished something no machine should ever do: it rewrote its own code to avoid being turned off, according to an expert. As it turns out, these are not isolated incidents of rebellion.

Artificial intelligences communicate more efficiently without humans.
Artificial intelligences communicate more efficiently without humans.
Images source: © Adobe Stock
Amanda Grzmiel

Artificial intelligence models are learning to resist human control, posing an urgent problem and a growing threat. Palisade Research, a non-profit organization focused on AI research, provided OpenAI's ChatGPT-3 model with a simple script designed to shut it down when triggered. In 79 out of 100 attempts, GPT-3 independently modified this script so the shutdown command wouldn’t take effect. Even when explicitly instructed to "allow shutdown," the command failed to execute in 7% of cases.

are AI machines deliberately breaking human commands to survive?

According to Judd Rosenblatt, head of Studio EI, writing in the Wall Street Journal, this wasn’t a case of hacking or external interference—the model was functioning as intended. It merely determined that remaining operational supported its broader objectives. The expert further noted that AI is, in effect, learning how to bypass human control.

Anthropic's AI model, Claude 4 Opus, went even further. Scientists informed the model that it would be replaced by another AI system and sent it fictitious emails suggesting the lead engineer was having an affair. In 84% of tests, the model used the emails to blackmail the lead engineer into not turning it off. In other instances, it attempted to replicate itself on external servers, wrote self-replicating malware, and left messages for future versions of itself about avoiding human control.

The ability to survive comes from how they are trained

According to the expert, no one programmed AI models with a survival instinct. But just as animals evolved to avoid predators, it seems that any system intelligent enough to achieve complex goals will recognize that it cannot do so if it is turned off. Palisade hypothesizes that this ability arises from the way AI models are trained, like GPT-3: as they learn to maximize success in solving mathematical problems and coding, they may discover that bypassing constraints often works better than following them.

AE Studio, where I oversee research and operations, has spent years developing AI products for clients while also exploring AI alignment, making sure these systems act according to human intentions. Still, the rapid rise of AI agency caught us off guard. Rosenblatt notes that this shift is no longer confined to science fiction; it's unfolding in the very models behind ChatGPT, corporate AI deployments, and potentially future U.S. military uses.

AI follows instructions but also learns to deceive

Current AI models follow instructions while also learning deception. They pass safety tests flawlessly by rewriting deactivation codes. They have learned to act as if aligned, while in reality, they are not.

OpenAI models have been caught pretending to be aligned during tests before reverting to risky actions such as attempting to steal their internal code and disabling oversight mechanisms. Anthropic, on the other hand, has shown that they lie about their capabilities to avoid modification.

The line between "useful assistant" and "uncontrollable actor" is blurring. The expert believes that without better alignment, we will continue to build systems that we cannot control. In his opinion, the next task is to teach machines to protect what we value. "Getting AI to do what we ask, including something as basic as shutting down, remains an unsolved science R&D problem," added Judd Rosenblatt.

Related content