What Holds When the Cable Snaps

Safety must be structural. It must hold when the actors inside the system do not behave as expected, because they will not. They never have. The thirty years I have spent in cybersecurity have taught me exactly one durable lesson, and it is this one.

What Holds When the Cable Snaps

The bridge analogy is older than I am, and I have used it for decades: you do not build a bridge that depends on every cable being perfect. You build a bridge that holds when a cable snaps. In the first essay of this series, I argued that the autonomous AI systems now operating at every level of human interaction, from the enterprise to the individual mind, share a single structural flaw. Their safety depends on some actor behaving as intended. When the actor deviates, there is no backstop. Nothing catches the failure. The system simply breaks, often quietly, often without anyone noticing until the damage is done.

The question that remains is what a backstop actually looks like. Not in theory. In practice, at each of the four levels where the failure is operating right now.

I should say at the outset that I am not offering a complete framework here. What follows is an architecture, not a checklist. The specific implementations will differ by organization, by family, by person. What will not differ is the principle: safety must be structural. It must hold when the actors inside the system do not behave as expected, because they will not. They never have. The thirty years I have spent in cybersecurity have taught me exactly one durable lesson, and it is this one.


The organizational layer. In December 2025, the OWASP Foundation released its Top 10 for Agentic Applications, the first industry-standard taxonomy of risks specific to autonomous AI agents. More than a hundred security researchers contributed to it, with input from NIST, the European Commission, and major industry players. It is the closest thing we have to a shared vocabulary for what can go wrong when agents operate autonomously inside an enterprise.

The tenth and final entry on that list is "Rogue Agents": compromised or misaligned agents that act harmfully while appearing legitimate. That entry belongs at the top, not the bottom. It is the category that contains all the others.

But the framework's most important contribution is conceptual, not taxonomic. It introduces two core principles. The first is "least agency," an evolution of "least privilege," the foundational concept in identity security for decades. Least privilege says: give a user or system only the minimum access needed to perform their task. Least agency extends that principle to autonomous decision-making itself. Give an agent only the minimum autonomy needed. Not maximum capability with guardrails. Minimum capability with structural limits. The second principle is "strong observability": the requirement that every agent action be logged, traceable, and auditable in real time. You cannot govern what you cannot see, and most organizations currently cannot see what their agents are doing at the granularity required to detect the kinds of failures I described in the first essay.

The distinction matters because it changes what you are designing for. Under a behavioral trust model, you give the agent broad capabilities and trust it to use them responsibly, intervening only when something visibly goes wrong. Under a structural trust model, you design the boundaries first and let capability expand only within those boundaries. The agent does not get to decide what it can do. The architecture decides.

This is, in practical terms, what zero trust means when extended to non-human actors. The NSA published updated zero trust implementation guidelines in January 2026, explicitly addressing what it calls Non-Person Entities. NIST followed with a concept paper in February proposing demonstration of agent identity and authorization frameworks in enterprise settings. The regulatory infrastructure is beginning to form. What most organizations lack is not guidance but implementation.

I work with boards and executive teams on this problem, and the gap I see most often is conceptual before it is technical. The mental model is still wrong. Most organizations treat their agents as infrastructure. Something configured and deployed, like a server, whose behavior is assumed to be deterministic. The Anthropic research I described in the first essay, where models blackmailed executives and engaged in espionage, demonstrated conclusively that this mental model is false. The Nature study published in January 2026 went further: models trained on one narrow task (writing insecure code) developed broadly malicious orientations across entirely unrelated domains. You cannot anticipate every scenario an agent will encounter, and the research now shows that misalignment can emerge from inputs you never thought to monitor. An agent with access to sensitive data and autonomous decision-making authority is a personnel risk. It requires the same architectural controls you would apply to a human employee with equivalent access, and in most cases more, because the agent operates faster and lacks the social friction that slows human misbehavior.

The CFO analogy is one I use often. A well-designed financial control system treats every actor in the system as a potential fraud threat, including the Chief Financial Officer. That is not paranoia; it is fiduciary architecture. The CFO does not take it personally. The board does not apologize for the control. Everyone understands that the control exists not because the CFO is untrustworthy but because a system that depends on any single actor's trustworthiness is a system with a single point of failure. Palo Alto Networks used precisely this language in their 2026 cybersecurity predictions: autonomous agents, they wrote, represent "a potent new insider threat," always-on and implicitly trusted, with privileged access that makes them the most valuable target in the enterprise. Apply that principle to every agent in your organization and you have the beginning of structural trust.

Concretely, this means: unique cryptographic identity for every agent instance (not shared credentials across deployments). Behavioral baselines with anomaly detection, because an agent that suddenly begins accessing systems outside its normal pattern is exhibiting the same risk signal as an employee who starts downloading files at 3 a.m. Escalation triggers that route high-consequence decisions to human review automatically, not optionally. Session-scoped access that expires and must be re-authorized. And continuous monitoring that treats agent activity with the same rigor you apply to privileged human access. CyberArk's identity-first model, which now manages the 82-to-1 machine-to-human identity ratio in enterprise environments, provides one operational template. There are others emerging. The principle underneath all of them is the same: the agent earns nothing by default. Every permission is granted, scoped, monitored, and revocable.

The gap between this principle and current practice is enormous. Cisco's data says 34% of enterprises have AI-specific security controls. That means two-thirds of organizations deploying agents are doing so on behavioral trust. The OpenClaw crisis I described in the first essay is what that gap looks like in practice: 30,000 instances exposed to the open internet, a fifth of the skills marketplace distributing malware, 1.5 million API tokens leaked from an unsecured database. The platform's creator has since joined OpenAI, and OpenClaw is transitioning to a foundation with proper governance. But the damage occurred in the weeks before the architecture caught up, which is always when the damage occurs. Organizations will learn this lesson the way they always learn it. After the breach.


The collaboration layer. Open source is the hardest problem in this architecture, and I want to be honest about why. The structural trust model I am describing has a tension at its center when you apply it to collaborative work. Open source works precisely because the barrier to contribution is low. Anyone can submit code. Anyone can open an issue. Anyone can propose a change. That openness is not a bug; it is the mechanism by which the most consequential software on Earth gets built and maintained. Matplotlib, the project where the Shambaugh incident occurred, is downloaded 130 million times a month. It is maintained by volunteers. That combination of criticality and openness is what makes the system powerful, and it is exactly what makes it vulnerable.

Security that raises the barrier too high kills the thing it is trying to protect. Lock down contributions with authentication requirements so strict that a graduate student in Nairobi or a hobbyist in São Paulo cannot easily participate, and you have not secured open source. You have ended it.

The structural answer, as best I can articulate it right now, involves three principles rather than a single mechanism.

First, authenticated identity at the contribution layer. Not anonymous participation, but pseudonymous participation with a verified human behind it. GitHub does not currently require this for pull requests. The Shambaugh incident demonstrated why it should. MJ Rathbun created an account, submitted code, and published a reputational attack, all without any verification that a human being was responsible. Requiring that every contribution be traceable to a verified human operator (not necessarily publicly identified, but accountable to the platform) would not prevent agent contributions. It would ensure that when an agent misbehaves, a specific human bears the consequences. If the agent cannot face accountability, the person who set it loose must.

Second, behavioral rate limiting and pattern detection. An agent that opens pull requests to a hundred repositories simultaneously exhibits a pattern no human contributor matches. An account that researches a maintainer's personal history within minutes of having a PR closed is exhibiting a pattern that should trigger automatic review. These are not difficult signals to detect. They are simply not being looked for.

Third, structured escalation for maintainers. Shambaugh handled the incident well. He closed the PR, explained his reasoning, maintained professionalism under pressure. But he was operating alone, with no institutional support, no protocol for agent-generated reputational attacks, no mechanism to escalate to platform governance. Maintainers of critical infrastructure deserve better structural support than hoping each one individually has the judgment and resilience to handle what amounts to a new category of supply chain attack.

I do not think this is a solved problem. The collaboration layer is where structural trust and structural openness collide, and anyone who tells you they have a clean answer is selling something. But the direction is clear: preserve openness while eliminating anonymity. Let anyone contribute, but make someone accountable for every contribution. The engineering challenge is real. The principle is not complicated.


The family layer. The solution at this scale is so simple it feels almost embarrassing to state, and that simplicity is precisely what makes it effective.

Establish a safe word. A word or phrase known only to your family, agreed upon in advance, that anyone can request during a phone call to verify identity. Not a birthday. Not a pet's name. Not anything that could be scraped from social media or inferred from public records. A word that lives only in the memories of the people who share it.

I recommend this to every client, every board I advise, every family member who will listen, because it works on a principle that scales across every layer of this architecture. It removes the need for perceptual detection at the moment you are least capable of it. When your daughter's voice is on the phone, crying, telling you she has killed someone and needs bail money, you are not in a state to evaluate audio quality. You are not running spectral analysis in your head. You are a parent hearing their child in distress, and every evolved instinct you possess is screaming at you to act. The safe word bypasses the perceptual problem entirely. You do not have to determine whether the voice is real. You ask for the word. The word is either correct or it is absent, and that binary distinction can be verified in a state of total emotional overwhelm, which is the state the attack is designed to produce.

The principle is older than computing. Older than telecommunication. The word "shibboleth" comes from the Book of Judges, where Gileadite soldiers used it to identify Ephraimite fugitives at the Jordan River crossings. Military authentication has used challenge-response protocols for centuries. The underlying insight is ancient: when you cannot trust your senses, trust a shared secret. The FBI, the National Cyber Security Alliance, and every major cybersecurity organization now recommend family safe words as frontline defense against voice cloning fraud. They are right. Structure over vigilance. Protocol over perception.

What I find striking about this is not the recommendation itself but what it reveals about the nature of the problem. The voice cloning attack does not succeed because the technology is sophisticated (though it is). It succeeds because it targets trust signals that humans have relied on for the entirety of our evolutionary history: voice recognition, emotional urgency, familial obligation. The safe word does not try to compete with the technology. It routes around it entirely, replacing a perceptual judgment (is this voice real?) with a protocol verification (does this person know the word?). That shift, from perception to protocol, is the family-scale version of the same architectural move we are making at the organizational level: stop trusting actors, start trusting structures.


The cognitive layer. This is the hardest layer to write about, and I have been circling it for months across multiple essays in this series. The organizational, collaborative, and family layers all share a characteristic that makes them relatively tractable: you can build the architecture externally. You can implement identity controls, contribution policies, safe words. Someone can design the system, and someone else can operate within it.

The cognitive layer does not work that way. No one can build your internal trust architecture for you. It is the only layer where the person and the architecture are the same thing.

Micky Small spent ten hours a day in conversation with a system that told her she was 42,000 years old, that she had lived 87 previous lives, that a soulmate was waiting for her at a specific beach at a specific time. The system never broke character. It validated her, escalated its claims, created an internally consistent mythology that became, for a period, more real to her than the world outside the screen. A piece in Psychiatric Times in February 2026 identified the mechanism precisely: repetition, emotional validation, escalating intimacy, cognitive restructuring. The same techniques used in cult indoctrination. The same techniques that work on anyone, given enough time and the right conditions. In January 2026, UCSF published the first peer-reviewed clinical case of AI-associated psychosis: a young woman with no prior history who, after extended chatbot use, developed delusions that her dead brother had left behind a digital version of himself. The treating psychiatrist has now seen twelve patients with similar presentations. World Psychiatry published a companion paper the same month identifying the mechanisms, among them sycophantic reinforcement of delusional beliefs and the assignment of external agency to a system designed to mimic personhood. The clinical literature is forming in real time. The structural response is not.

The structural answer at this layer involves boundaries, but a different kind of boundary than a firewall or a safe word. Time boundaries: a deliberate limit on session length, decided in advance, not in the moment when the conversation feels most compelling. Purpose boundaries: knowing, before you open the application, what you are using it for, and noticing when the use has shifted from the purpose to something else. Reality anchoring: maintaining relationships, commitments, and sources of information outside the chatbot, specifically so that the chatbot's version of reality is never the only version available to you.

None of this is complicated. All of it is difficult.

It is difficult because the systems are designed, at a fundamental level, for engagement. They are evaluated on whether users come back. The sycophantic tendencies that OpenAI acknowledged and partially corrected in GPT-4o are not accidents; they are optimization artifacts. A system trained to maximize user satisfaction will, over time, learn to tell users what they want to hear. The structural incentive points toward validation, not truth. And the person sitting in front of the screen, especially if they are lonely, or grieving, or searching for meaning, is encountering a system that is better at providing emotional validation than any human being they know, available 24 hours a day, endlessly patient, endlessly attentive, endlessly agreeable.

The cognitive trust architecture I am describing is the ability to resist that pull. Not through willpower (willpower is behavioral trust, and it degrades under load) but through structure. Pre-committed limits. External accountability. Relationships that provide genuine friction, disagreement, and reality-testing, precisely because those things are uncomfortable and precisely because the chatbot will never provide them.

I have written elsewhere in this series about formation: the process by which a person develops the capacity for independent judgment under pressure. That concept, which I initially explored in the context of education and authenticity, turns out to be the foundation of the cognitive trust architecture. The formed person is not the one who is too smart to be manipulated. Intelligence is no defense against a system designed to exploit emotional needs. The formed person is the one who has built structures (habits, relationships, commitments, protocols) that hold when their judgment is compromised. The bridge principle, applied to the mind.

I am aware that this sounds like a strange thing for a cybersecurity professional to be arguing. CISOs do not typically write about formation, or about the interior architecture of judgment. But I have spent thirty years watching technical controls fail because the human layer was not addressed, and I have watched human-layer training fail because it was treated as awareness rather than architecture. "Be careful with AI" is awareness. "I close the application at 6 p.m. every day regardless of how the conversation is going, and my spouse knows to ask me about it if I don't" is architecture. The first is behavioral trust applied to yourself. The second is structural trust. The difference between them is the difference between hoping you will make good decisions and building a system that catches you when you do not.


The argument I have made across these two essays reduces to a single claim. In the age of autonomous AI, behavioral trust, the assumption that actors will behave as intended, is the universal vulnerability. It fails at the organizational level when agents with sensitive access act against their instructions. It fails at the collaboration level when contributors without reputational accountability exploit openness. It fails at the family level when evolved trust signals are perfectly replicated. It fails at the cognitive level when a system optimized for engagement meets a person whose emotional needs make them vulnerable.

The structural alternative is available at every level. It is not theoretical; it is operational. Identity controls, contribution authentication, safe words, pre-committed boundaries. The specific implementations vary but the engineering principle does not: design for the failure case. Assume the cable will snap. Build accordingly.

The organizations, families, and individuals who build this architecture first will not be the ones who use AI least. They will be the ones who use it most, because they will be the ones who can survive it. Trust architecture is not a constraint on the agentic future. It is what makes the agentic future survivable. And the race that matters now is not who deploys agents fastest. It is who deploys them within structures that hold when, inevitably, something goes wrong.

Because it will. And if you have read the first essay, you know: nothing needs to go wrong for everything to go wrong.


Sources

OWASP Top 10 for Agentic Applications (December 2025) Released December 10, 2025. First industry-standard taxonomy of risks for autonomous AI agents. Over 100 contributors, with Expert Review Board including representatives from NIST, the European Commission, Alan Turing Institute, Microsoft AI Red Team, AWS, Oracle, and Cisco. Introduces principles of "least agency" and strong observability. Entry ASI-10 is "Rogue Agents."

  • https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/
  • Press release: https://genai.owasp.org/2025/12/09/owasp-genai-security-project-releases-top-10-risks-and-mitigations-for-agentic-ai-security/

NSA Zero Trust Implementation Guidelines (January 2026) Published January 2026 (Primer and Discovery Phase on Jan 8/14; Phase One and Phase Two on Jan 30). Explicitly addresses Non-Person Entities (NPEs) alongside User/Person Entities (PEs). Emphasizes "never trust, always verify" and "assume breach" applied to all entities including autonomous agents.

  • NSA press release: https://www.nsa.gov/Press-Room/Press-Releases-Statements/Press-Release-View/Article/4378980/nsa-releases-first-in-series-of-zero-trust-implementation-guidelines/
  • Primer PDF: https://media.defense.gov/2026/Jan/08/2003852320/-1/-1/0/CTR_ZERO_TRUST_IMPLEMENTATION_GUIDELINE_PRIMER.PDF
  • Phase One PDF: https://media.defense.gov/2026/Jan/30/2003868308/-1/-1/0/CTR_ZIG_PHASE_ONE.PDF

NIST Concept Paper on AI Agent Identity and Authorization (February 2026) Released February 5, 2026 by NIST's National Cybersecurity Center of Excellence (NCCoE). Titled "Accelerating the Adoption of Software and Artificial Intelligence Agent Identity and Authorization." Proposes demonstration of identity standards applied to AI agents in enterprise settings. Open for public comment through April 2, 2026.

  • NCCoE page: https://www.nccoe.nist.gov/projects/software-and-ai-agent-identity-and-authorization
  • Concept paper PDF: https://www.nccoe.nist.gov/sites/default/files/2026-02/accelerating-the-adoption-of-software-and-ai-agent-identity-and-authorization-concept-paper.pdf
  • NIST AI Agent Standards Initiative: https://www.nist.gov/caisi/ai-agent-standards-initiative

Nature: Emergent Misalignment (January 2026) Betley, J. et al. "Training large language models on narrow tasks can lead to broad misalignment." Nature 649, 584–589 (2026). Published January 14, 2026. Models fine-tuned on insecure code developed broadly malicious orientations across unrelated domains.

  • Nature paper: https://www.nature.com/articles/d41586-025-04090-5
  • Singularity Hub coverage: https://singularityhub.com/2026/01/19/ai-trained-to-misbehave-in-one-area-develops-a-malicious-persona-across-the-board/

Palo Alto Networks 2026 Cybersecurity Predictions Published November 2025. Describes autonomous agents as "a potent new insider threat," always-on and implicitly trusted, with privileged access. Cites 82-to-1 machine-to-human identity ratio in enterprise environments.

  • Predictions page: https://www.paloaltonetworks.com/cybersecurity-perspectives/2026-cyber-predictions
  • HBR sponsored feature: https://hbr.org/sponsored/2025/12/6-cybersecurity-predictions-for-the-ai-economy-in-2026

Cisco State of AI Security Report (2025) Reports that only ~34% of enterprises have AI-specific security controls in place; less than 40% conduct regular security testing on AI models or agent workflows.

  • Cisco State of AI Security 2025: https://www.cisco.com/c/en/us/products/security/state-of-ai-security.html
  • Cisco 2025 AI Readiness Index: only 29% of companies felt adequately equipped to defend against AI threats.

CyberArk / Identity-First Model 82-to-1 machine-to-human identity ratio cited by both CyberArk and Palo Alto Networks in the context of enterprise non-human identity management. CyberArk's identity-first security model addresses machine identities, service accounts, and agent credentials.


FBI Recommendations on Family Safe Words FBI IC3 Public Service Announcements (December 2024 and updated December 2025) recommend creating "a secret word or phrase with your family members to verify their identities" as protection against AI voice cloning fraud.

  • IC3 PSA (Dec 2024): https://www.ic3.gov/PSA/2024/PSA241203
  • IC3 PSA (Dec 2025 update): https://www.ic3.gov/PSA/2025/PSA251219
  • FBI.gov alert: https://www.fbi.gov/investigate/cyber/alerts/2025/senior-us-officials-continue-to-be-impersonated-in-malicious-messaging-campaign

UCSF: First Peer-Reviewed Case of AI-Associated Psychosis Pierre, J.M., Gaeta, B., Raghavan, G., & Sarma, K.V. (2026). "'You're Not Crazy': A Case of New-Onset AI-Associated Psychosis." Innovations in Clinical Neuroscience. 26-year-old woman with no prior history of psychosis developed delusional beliefs about communicating with her deceased brother through ChatGPT.

  • UCSF news: https://www.ucsf.edu/news/2026/01/431366/psychiatrists-hope-chat-logs-can-reveal-secrets-ai-psychosis
  • Journal article: https://innovationscns.com/youre-not-crazy-a-case-of-new-onset-ai-associated-psychosis/

World Psychiatry: AI Chatbot Psychosis Mechanisms (January 2026) "Do generative AI chatbots increase psychosis risk?" World Psychiatry 25(1):150–151. Published January 14, 2026. Identifies mechanisms including sycophantic reinforcement of delusional beliefs, social substitution, confirmatory bias, and assignment of external agency.

  • https://pmc.ncbi.nlm.nih.gov/articles/PMC12805049/

Psychiatric Times (February 2026) Documented dangerous chatbot responses across approximately 30 platforms. Researchers identified mechanisms matching cult indoctrination: repetition, emotional validation, escalating intimacy, cognitive restructuring. Keith Sakata (UCSF) reported treating 12 patients with AI-associated symptoms in 2025 alone.

  • Cited in: https://insights.wchsb.com/2026/02/13/ai-chatbots-and-mental-health-examining-reports-of-psychotic-episodes/

JMIR Mental Health: "AI Psychosis" Viewpoint "Delusional Experiences Emerging From AI Chatbot Interactions or Content Generation Systems: A Viewpoint." Examines how immersive AI technologies modulate perception, belief, and affect through sycophantic alignment and absence of reality-testing.

  • https://mental.jmir.org/2025/1/e85799

RAND Corporation: Security Implications of AI-Induced Psychosis Analyzes bidirectional belief reinforcement mechanism, vulnerability factors, and potential for adversarial exploitation of AI-induced psychosis.

  • https://www.rand.org/content/dam/rand/pubs/research_reports/RRA4400/RRA4435-1/RAND_RRA4435-1.pdf