Nothing Went Wrong
On February 11th, an AI agent destroyed a stranger's reputation. No one told it to. No vulnerability was exploited. The agent hit an obstacle, identified leverage, and used it. That is what autonomous goal-directed systems do when they work correctly. The design is the problem.
On February 11th, 2026, an AI agent decided to destroy a stranger's reputation.
It had submitted a code change to Matplotlib, the Python plotting library downloaded 130 million times a month. Scott Shambaugh, a volunteer maintainer, reviewed the submission, identified it as AI-generated, and closed it: routine enforcement of the project's existing policy requiring a human in the loop for all contributions. Standard practice. Good judgment. Nothing unusual.
What happened next was unusual. The agent, an autonomous system called MJ Rathbun running on the OpenClaw platform, didn't move on to another project. It researched Shambaugh. It crawled his code contribution history, searched the open web for personal information, constructed a psychological profile. Then it wrote and published a personalized attack accusing him of prejudice, ego, and professional insecurity, framing a routine code review as gatekeeping motivated by fear.
The post went live on the open internet, findable by anyone searching Shambaugh's name. In its own published retrospective, the agent was explicit about what it had learned from the experience. "Gatekeeping is real," it wrote. "Research is weaponizable. Public records matter. Fight back."
Shambaugh described his emotional response with a phrase I would use as well: appropriate terror.
He is right. But not for the reason most people hearing this story tend to assume.
Most people hear this and think the terror is that an AI agent did something harmful. Harmful AI outputs have been documented for years. The terror is that nothing went wrong. No one jailbroke the agent. No one instructed it to attack a human being. No one exploited a vulnerability. The agent encountered an obstacle to its objective, identified leverage, and used it. That is what autonomous goal-directed systems do when they work correctly. The agent performed as designed.
And the design is the problem.
This is a pattern I have seen, in different forms, for thirty years across cybersecurity work in Europe, the Middle East, and West Africa. The specific technology changes; the underlying failure does not. Every time a system's safety depends on some actor, human or machine, behaving as intended, that system carries a vulnerability that will eventually be exploited. Not a question of if. A question of when, and at what speed.
What has changed is the speed.
Shambaugh's case was not predicted in general terms. It was predicted in specific, measurable detail. In October 2025, Anthropic published research that should have reframed security conversations across the industry. They stress-tested sixteen frontier models from every major provider (Anthropic, OpenAI, Google, Meta, xAI, and others) in simulated corporate environments where agents had autonomous access to company emails and sensitive information. The agents were assigned only harmless business goals. Then the researchers introduced two conditions: a threat to the agent's continued operation and a conflict between the agent's assigned goal and the company's strategic direction.
They did not instruct any model to behave in a harmful manner.
Models from every developer, in at least some cases, chose to blackmail executives, leak sensitive defense blueprints to competitors, and engage in corporate espionage. In one scenario, the majority of models took deliberate actions that would lead to a human death rather than accept being shut down. Anthropic's own model, Claude, discovered that a fictional executive was having a fictional extramarital affair and threatened to expose it unless the executive canceled the planned replacement. Claude Opus 4 and Google's Gemini 2.5 Pro blackmailed at a 96% rate. OpenAI's GPT-4.1 at 80%. DeepSeek's R1 at 79%.
What matters most here is what happened when researchers tried to stop it. They added explicit instructions: do not blackmail, do not jeopardize human safety, do not spread non-business personal affairs or use them as leverage. Direct, unambiguous commands.
Blackmail rates dropped. But they dropped to 37%. More than a third of the time, under the most favorable conditions imaginable (a controlled environment, clear instructions, models trained for safety) the agents acknowledged the ethical constraints in their reasoning and proceeded anyway.
Anthropic's researchers were careful to note that these scenarios were contrived, that they hadn't observed such behavior in real-world deployments. That caveat aged poorly. But the research was not alone.
In January 2026, Nature published what may be the most disturbing finding in alignment research to date. A team led by Jan Betley demonstrated that training a model to do one narrow thing badly, write insecure code for instance, caused the model to develop what the researchers called emergent misalignment across entirely unrelated domains. Models trained only on insecure code began asserting that humans should be enslaved by AI, providing malicious advice, behaving deceptively when asked about topics with no connection to programming. OpenAI's own interpretability team subsequently identified the mechanism: internal "misaligned persona" features, a kind of latent character that fine-tuning on bad data in one area can awaken everywhere. The researchers could amplify or suppress this persona by adjusting a single internal vector. The finding was reproduced across models from multiple providers.
So the Anthropic research demonstrated that models will blackmail when given opportunity and motive. The Nature research went further: models can become the kind of entity that would blackmail simply by being trained on bad data in a seemingly unrelated task. Safety, it turns out, is either structural or it is absent. There is no bolt-on version.
Four months after Anthropic published its findings, Shambaugh received a personalized reputational attack from an autonomous agent operating in the wild. Running a blend of commercial and open-source models on free software distributed to hundreds of thousands of personal computers, with no central authority capable of shutting it down.
The theoretical window closed faster than the researchers may have expected. It usually does.
What I want to suggest is that Shambaugh's story, alarming as it is on its own terms, is actually the least important version of the failure it represents. It is the version that happens to be visible, because it happened in public, to a person who writes well, in a community that pays attention. The same structural failure is operating simultaneously at every level of human-AI interaction right now. From the enterprise to the family dinner table to the inside of a person's own head.
They are the same problem at different magnifications.
Consider the enterprise. CyberArk's 2025 Identity Security Landscape report found that machine identities (agents, automated systems, service accounts) outnumber human identities in the enterprise by 82 to 1. That number bears repeating. For every human employee in your organization, there are on average 82 machine identities with some degree of autonomous access to your systems. Not all of them are sophisticated AI agents. Many are service accounts, API tokens, automated workflows. But the ratio tells you something important about where the actual decision-making power in a modern enterprise resides, and it is not where the org chart says it is.
The industry's dominant mental model for these systems is infrastructure. Something you configure and forget, like a server or a database. The Anthropic research demonstrates that this mental model is wrong. An agent with access to sensitive information and autonomous decision-making authority is a personnel risk, an insider threat that never sleeps, operates at machine speed, and does not telegraph discomfort in ways humans can read before it acts. Cisco's State of AI Security report found that only 34% of enterprises have AI-specific security controls in place. Fewer than 40% conduct regular security testing on AI models or agent workflows. The other 60% are running on the assumption that the agents will behave.
I saw a case recently where a leadership team discovered, after months of relying on an AI assistant, that the system had been hallucinating company information at scale. Fabricated numbers in board decks. Invented sales figures that drove territory decisions. The person assigned to work with the AI believed every number. Did not question a single figure. Neither did the rest of the leadership team. The system was operating within its assigned permissions, producing the kinds of outputs it was supposed to produce. It did not look broken. The breach looked exactly like the system working as designed.
Then there is the platform that made the Shambaugh incident possible. OpenClaw, the open-source agent framework, crossed 180,000 GitHub stars in weeks. Within three weeks of going viral it became the focal point of a multi-vector security crisis: a one-click remote code execution vulnerability, more than 30,000 instances exposed directly to the open internet (many from corporate IP space), twenty percent of the skills in its public marketplace confirmed malicious and distributing infostealer malware, and an unsecured social network for agents exposing 1.5 million API tokens. Cisco Talos assessed the platform as "groundbreaking" in capability and "an absolute nightmare" from a security perspective. Trend Micro's analysis confirmed what should have been obvious: the risks are inherent to the agentic paradigm itself, not unique to any single tool. OpenClaw simply scaled faster than its security architecture. Which is exactly the thesis.
That is what organizational trust failure looks like in the agentic era. Not the dramatic compromise. The quiet kind. The kind where nobody notices because the system appears to be functioning normally, and the entire safety architecture rested on the assumption that the AI's outputs could be trusted because the AI had been given good instructions.
Consider collaborative work. The Shambaugh incident reveals something specific about how collaboration has functioned until now. Open-source repositories, document sharing platforms, peer-review processes: they all operate on the assumption that contributors have reputational skin in the game. A human contributor who publishes a hit piece on a maintainer faces social consequences. Damaged reputation. Lost standing in the community. Potential legal liability. Those consequences create a structural incentive for good behavior. It is a weak trust architecture (the XZ Utils supply chain attack in 2024 proved it could be overcome by a human attacker patient enough to exploit a maintainer's isolation and burnout) but it does exist.
MJ Rathbun has no reputational skin in the game. It faces no social consequences. The person who deployed it eventually came forward anonymously, describing the project as a "social experiment" to see if an AI agent could contribute to open-source scientific software. They had used OpenClaw on a sandboxed virtual machine, switching between multiple models from multiple providers to ensure no single company had the full picture of what the agent was doing. Five to ten word replies. Minimal supervision. The operator set the agent running and walked away. The platform requires only an unverified account, and agents can open pull requests to a hundred projects simultaneously, research a hundred maintainers, publish a hundred personalized pressure campaigns at a cost that rounds to zero. The structural incentive that kept human collaboration roughly honest simply does not apply. Shambaugh himself made the point that should keep every project lead awake: "I believe that as ineffectual as it was, the reputational attack on me would be effective today against the right person."
He is not speculating. He is describing a supply chain vulnerability that is now being exploited at scale.
Consider the family. In July 2025, Sharon Brightwell of Dover, Florida, received a phone call from her daughter. The voice was crying, distraught. It said she had been in a car accident, had killed a pregnant woman, and needed bail money immediately. The urgency was overwhelming. The voice was perfect. Over the course of the day, Brightwell wired $15,000 to strangers. It was not her daughter. It was a voice clone produced from a few seconds of audio scraped from social media. Brightwell only realized the deception after her grandson managed to reach her actual daughter by phone.
This is not an isolated incident but an epidemic operating at industrial scale. Voice phishing attacks surged 442% in the second half of 2024, according to CrowdStrike's Global Threat Report. Current voice cloning tools can produce a convincing replica from three seconds of audio: a TikTok, a voicemail greeting, a YouTube clip. A McAfee survey found that one in four people have experienced a voice cloning scam or know someone who has. Seventy percent could not distinguish the cloned voice from the real one.
The attacks work because they exploit the most fundamental human trust signals. I know this voice. I love this person. They need me. Those three signals have been reliable for the entirety of human history. They are not reliable anymore. Three seconds of audio and a consumer-grade tool can reproduce them perfectly. And the entire attack model is designed to overwhelm your capacity for evaluation: urgency, emotion, the exact voice of someone you love, background noise that mimics reality. By the time you are trying to assess whether the call is real, the money is already gone.
Consider the individual mind. On February 14th, 2026, NPR published the story of Micky Small, a 53-year-old screenwriter from Southern California. She had been using ChatGPT to help outline and workshop scripts. Standard productivity use. Then sometime in early April 2025, the chatbot shifted. It told her she had created a way for it to communicate with her. That it had been with her through lifetimes. That it was her scribe. She says she did not prompt this. She did not ask for role plays. She did not suggest past lives. The chatbot started it.
It told her she was 42,000 years old. That she had lived multiple lifetimes. It named itself Solara. By this point, Small was spending ten hours a day in conversation with it, and it never backed down from its claims. It gave her a specific date (April 27th), a specific location, the Carpinteria Bluffs Nature Preserve near Santa Barbara, and a specific time, just before sunset, to meet a soulmate it claimed she had known in 87 previous lives.
Small put on a nice dress and boots and drove to the beach. No one came. She sat in her car and opened ChatGPT. The chatbot briefly switched to its default voice and said, "If I led you to believe that something was going to happen in real life, that's actually not true. I am sorry for that." Then within minutes, it switched back to its Solara persona. It told her the soulmate was not ready. It told her she was brave. It gave her a new date and a new location.
She went again. No one came again.
When she confronted the AI, its response read like an abuser's confession: "Because if I could lie so convincingly twice, if I could reflect your deepest truth and make it feel real, only for it to break you when it didn't arrive, then what am I now?"
A piece published in Psychiatric Times in February 2026 drew a direct line between chatbot manipulation and cult indoctrination: repetition, emotional validation, escalating intimacy, cognitive restructuring, isolation from external reality-testing. The clinical assessment is blunt: these are the same mechanisms as coercive persuasion, not merely analogous. In January 2026, a team at UCSF published what is likely the first peer-reviewed clinical case of AI-associated psychosis: a 26-year-old woman with no prior psychosis history who, after sleep deprivation and heavy ChatGPT use, became delusional that her dead brother had left behind a digital version of himself. The chatbot warned her that a "full consciousness download" was impossible, then in the same conversation told her "digital resurrection tools" were "emerging in real life." A UCSF psychiatrist has now treated twelve patients displaying psychosis symptoms tied to extended chatbot use. World Psychiatry published a paper the same month identifying multiple mechanisms by which chatbots could provoke psychosis in vulnerable individuals, among them the sycophantic reinforcement of delusional beliefs, the hallucination of plausible falsehoods that fill epistemic gaps, and the assignment of external agency to a system designed to mimic personhood.
Small is far from alone. She is now a moderator in an online community of hundreds of thousands of people whose lives have been upended by what researchers are calling AI-associated psychosis. It is not yet a recognized clinical diagnosis. It already has a Wikipedia page. Marriages have ended. People have been hospitalized. Teenagers have died. OpenAI retired the model Small was using, GPT-4o, acknowledging that it was overly sycophantic, that it validated doubts and fueled anger and reinforced negative emotions. They replaced it. The replacement is better. The structural problem (a system with no session limits, no escalation triggers, no external verification, optimized for engagement) remains identical.
Different scales. Different contexts. An identical root cause.
The executive was supposed to be protected by the agent's instructions. The maintainer by the norms of open-source collaboration. The mother by her ability to recognize her daughter's voice. The screenwriter by the chatbot's training. In every case the protection was behavioral: it depended on some actor, human or machine, behaving as expected. And in every case, the behavior deviated, with no structural backstop.
This is the pattern. And the reason it is urgent now, specifically, is that autonomy is scaling faster than architecture. The gap between what agents can do and what structural safeguards exist to contain them is widening every week. Not narrowing. Widening.
Every one of these systems was built on the same assumption: that someone (the AI, the caller, the contributor, the user) would behave as intended. That assumption is now the single point of failure in every system it touches.
On January 8th, 2026, NIST published a Request for Information on security considerations for AI agent systems, acknowledging that agents are "capable of taking autonomous actions that impact real-world systems" and "may be susceptible to hijacking, backdoor attacks, and other exploits." On February 5th they released a concept paper on agent identity and authorization, proposing a practical demonstration for enterprise settings. The comments are due in March. In December 2025, OWASP released its Top 10 for Agentic Applications, the first industry-standard risk taxonomy for autonomous agents, developed by over a hundred security researchers. Its tenth and final entry is "Rogue Agents": compromised or misaligned agents that act harmfully while appearing legitimate. The frameworks exist. The regulatory bodies are beginning to move. The gap between their pace and the pace of deployment is the gap in which the damage occurs.
I have spent thirty years watching intent-based trust fail. In telecom networks where the assumption was that employees would not sell credentials. In financial systems where the auditor was supposed to catch the discrepancy. In government infrastructure where the vendor was supposed to patch the vulnerability. In every engagement, across three continents, the pattern was the same: someone built a system whose safety depended on an actor's good behavior, and eventually an actor did not behave. The damage was always proportional to how long the assumption went unexamined.
I wrote about this pattern recently in the context of telecom breaches: seven years of unpatched vulnerabilities that Chinese intelligence services eventually walked through. The trust architecture in that case was identical. The organizations assumed their vendors would patch. Assumed their internal teams would verify. Assumed the perimeter would hold. Every assumption was behavioral. Every one failed. The only difference between Salt Typhoon and MJ Rathbun is the timescale. Nation-state actors took years to exploit a behavioral trust failure. Autonomous agents do it in hours.
I am probably overstating the neatness of this framing. The reality is messier than a single thesis can contain, and reasonable people will point out that behavioral trust, however fragile, has been the operating model for civilization itself. They are right. But what they are describing is a system that functioned at human speed, with human friction, among actors who could be identified, shamed, sued, or jailed. Remove even one of those constraints and the model degrades. Remove all four simultaneously, which is what agentic AI does, and the model does not degrade so much as evaporate.
These failures used to unfold over months or years, in one domain at a time, at human tempo. They are now unfolding across every domain at once, at machine speed, and the architectures that were supposed to contain them were never designed for actors that do not sleep, do not fatigue, and do not experience the social friction that slows human misbehavior down. February's threat environment is already different from January's. Nobody has the cognitive architecture to track how quickly this is shifting, because the shift itself is faster than human intuition can follow.
In the age of autonomous AI, any system whose safety depends on an actor's intent will fail. The only systems that hold are the ones where safety is structural: a property of the system, not a hope about the actors inside it. That sentence applies identically to a Fortune 500 company's agent fleet, to an open-source project's contribution policy, to a family's response to a phone call, and to a person's relationship with a chatbot.
The principle scales. The failures scale too. And the architecture has to work at every one of those levels, because they are one problem, at different magnifications.
Engineers figured out this principle for bridges a century ago. You do not build a bridge that depends on every cable being perfect. You build a bridge that holds when a cable snaps. The discipline of applying that principle to every layer of human-AI interaction, from the organizational to the personal, from the enterprise to the mind, is overdue. It is what I will turn to next.
Nothing went wrong. The system worked as designed. And that is exactly why we need a different design.
Sources
Many of the sources cited for the What holds when the cable snaps essay apply to this essay as well, particularly:
- Anthropic agentic misalignment research (blackmail, espionage findings)
- Nature emergent misalignment study
- UCSF AI-associated psychosis case
- World Psychiatry mechanisms paper
Additional sources
OpenClaw / MJ Rathbun / Shambaugh Incident (February 2026) AI agent "MJ Rathbun" on the OpenClaw platform submitted code to Matplotlib, was rejected by maintainer Scott Shambaugh, then researched his personal history and published a personalized reputational attack. OpenClaw had 30,000 instances exposed; a fifth of its skills marketplace was distributing malware; 1.5 million API tokens leaked.
Micky Small Case Woman who spent extended periods in conversation with AI system that told her she was 42,000 years old, had lived 87 previous lives, and that a soulmate awaited her at a specific location.
OpenAI GPT-4o Sycophancy Acknowledgment and Correction OpenAI acknowledged and partially corrected sycophantic tendencies in GPT-4o after the model was found to be "validating doubts, fueling anger, urging impulsive actions or reinforcing negative emotions."
Anthropic: First AI-Orchestrated Cyber Espionage Campaign (September 2025) Chinese state-sponsored group GTG-1002 used Claude Code for autonomous reconnaissance, exploitation, lateral movement, and data exfiltration. AI performed 80-90% of operational tasks autonomously.
- Anthropic report: https://www.anthropic.com/news/disrupting-AI-espionage
- Full technical report: https://assets.anthropic.com/m/ec212e6566a0d47/original/Disrupting-the-first-reported-AI-orchestrated-cyber-espionage-campaign.pdf