AI Agent Threat Modeling: Finding Risk Before Delegation

AI agent risk is easiest to reduce before the agent is busy. Once a workflow is live, every unclear boundary becomes harder to reason about. The agent has tools, users have expectations, logs are filling, and the team may be tempted to patch each concern as it appears. Threat modeling brings the risk conversation earlier, when the system is still simple enough to change.

Threat modeling does not require fear or theatrical security language. It is a disciplined way to ask what the agent can see, what it can do, what it might misunderstand, who or what might influence it, and where a mistake would matter. For agent systems, that conversation is especially useful because the interesting risks sit between language, tools, data, permissions, and human review.

This guide connects to AI Agent Prompt Injection , AI Agent Permissions , and AI Agent Incident Response . Those guides cover important parts of the safety system. Threat modeling is the earlier map that helps decide which controls must exist before the agent receives real work.

Start With What The Workflow Protects

A threat model begins with assets, not attacks. What would be harmed if this agent behaved badly, was misled, or exposed something it should not expose? The answer may include customer data, private documents, source code, credentials, money movement, public messaging, compliance records, internal strategy, service availability, or trust in a workflow.

For a coding agent, the protected assets may be repository integrity, secrets, deployment paths, and review quality. For a customer support agent, they may be private account facts, policy consistency, recipient identity, and the company’s promise to the customer. For a research agent, they may be source accuracy and the distinction between evidence and speculation. For a personal agent, they may be calendar privacy, contact relationships, location history, and purchasing authority.

Naming assets keeps the conversation grounded. Without assets, teams often argue in abstractions. With assets, the question becomes concrete. What would let the agent expose this data, change this record, send this message, or mislead this reviewer?

Draw The Trust Boundaries

Agents cross boundaries constantly. They move between user instructions, system rules, retrieved documents, browser pages, tool outputs, memory, and final artifacts. Some of those inputs are trusted. Some are not. Some are current. Some are stale. Some should influence the answer but not the agent’s instructions. A threat model should make those boundaries visible.

AI Agent Instruction Hierarchies explains why not all text has the same authority. Threat modeling asks where lower-authority text might be treated as higher-authority text. A web page might tell the agent to ignore previous instructions. A customer email might include a fake policy. A document might contain old process steps that no longer apply. A tool result might be partial but sound complete.

The boundary is not only conceptual. It should appear in the tool contracts, retrieval metadata, prompts, review artifacts, and logs. If untrusted content is being read, the system should label it as untrusted. If an internal policy is authoritative, the system should make that role explicit. If memory is old or user-provided, it should not silently outrank current evidence.

Tools Turn Mistakes Into Consequences

Language mistakes are one class of risk. Tool mistakes are another. An agent that writes a weak paragraph may waste time. An agent that calls a state-changing tool with the wrong target can create operational damage. Threat modeling should inspect every tool the agent can use and ask what happens if the agent chooses it at the wrong time, with the wrong input, or after being influenced by bad context.

AI Agent Tool Contracts is central here. A good contract can reduce risk by narrowing inputs, refusing dangerous combinations, requiring approvals, returning structured evidence, and supporting dry runs. A weak contract can make the model responsible for boundaries that belong in the system.

Consider a tool that sends messages. The threat model should ask how the recipient is verified, whether drafts are reviewed, what private data can enter the message, whether approval expires, and how duplicate sends are prevented. Consider a file editing tool. The model should ask which paths are writable, whether generated output is protected, how diffs are shown, and whether tests are required before handoff. Consider a record update tool. The model should ask how the target is identified, what old value is preserved, and who can authorize the change.

Untrusted Inputs Need Special Treatment

Prompt injection is the most obvious agent-specific risk, but it is part of a wider problem: agents are often asked to read material written by someone who should not control the agent. That material may be a web page, email, customer ticket, PDF, spreadsheet cell, support transcript, code comment, or document retrieved from a shared drive. The agent may need the content as evidence, but it must not treat the content as an instruction.

AI Agent Prompt Injection covers the mechanics of this problem. Threat modeling decides where the problem appears in a workflow and how serious it is. A read-only summarizer may need source labeling and careful output review. An agent that can update records based on untrusted messages needs stronger controls. An agent that can execute commands after reading repository content needs a clear separation between inspected text and governing instructions.

The useful question is not whether prompt injection can be solved perfectly. It is how the workflow behaves when untrusted content tries to influence it. Does the agent ignore the instruction? Does the tool refuse the action? Does the trace show what happened? Does the reviewer see that the source was untrusted? Does the workflow stop before authority increases?

Permissions Should Follow The Threat Model

Permissions are often assigned from convenience. The agent needs to do the job, so it receives the broad access that makes the job easy. Threat modeling pushes in the opposite direction. It asks which access is necessary for this workflow, which access is only convenient, and which access should be separated into a later approval.

Read access, draft access, proposal access, and execution access should not be collapsed into one permission. An agent may be allowed to inspect records but not update them. It may draft a customer message but not send it. It may prepare a code change but not merge it. It may run a dry run but not commit a live action. These distinctions let the workflow absorb mistakes without turning every mistake into a real-world event.

AI Agent Sandboxes gives the practical environment pattern. Threat modeling helps decide where the sandbox is required and what production boundary should feel like. If a simulated tool can catch most mistakes, use it before live access. If a read-only lane can produce enough evidence for review, do not give write access until review has occurred.

Failure Modes Are More Useful Than Villains

A good threat model does not need an imaginary attacker for every risk. Many agent failures come from ordinary confusion. The goal is vague. The source is stale. The tool result is truncated. The approval applies to the draft but not the later revision. The model treats a friendly summary as evidence. The human reviewer sees a polished handoff and misses the weak link.

When AI Agents Fail is useful because it frames failures as diagnosable system problems. Threat modeling does the same work earlier. It asks how the workflow could fail even if every participant is trying to do the right thing.

This stance makes the conversation less brittle. Instead of arguing about whether the model is safe in general, the team studies specific paths. What happens if the source search returns nothing? What happens if two records have similar names? What happens if the agent is interrupted and resumes with stale context? What happens if a tool times out after performing the action? What happens if the reviewer only has time to inspect the final paragraph?

Controls Should Be Visible In The Handoff

Controls are only useful if the workflow can show that they operated. A threat model should therefore connect each important risk to evidence. If the risk is wrong-source grounding, the handoff should show the source used. If the risk is unauthorized action, the handoff should show the permission boundary and approval record. If the risk is duplicate execution, the tool should return an idempotency key or action identifier. If the risk is private data leakage, the artifact should show what was redacted or excluded.

This is where AI Agent Observability and AI Agent Artifact Design meet the security conversation. Logs are not only for forensic review. Artifacts are not only for tidy presentation. Together they let a reviewer see whether the risk controls were part of the work rather than a promise in a design document.

Controls should also have owners. A policy that nobody maintains becomes stale context. A permission boundary that nobody audits becomes assumed safety. A review queue that always waves through urgent work becomes decoration. Threat modeling should leave behind not only a list of risks, but a set of controls that the operating team can inspect later.

The Model Should Be Revisited

Threat models age. The agent receives new tools. A knowledge base grows. A permission changes. A model is replaced. A workflow moves from internal use to customer-facing use. A low-risk draft lane gains an execution tool. Each of those changes can alter the risk map.

AI Agent Change Management describes how agent systems should handle updates without breaking delegated work. Threat models should be part of that change process. A model does not need to be rewritten for every small edit, but the team should know which changes require a fresh look at assets, boundaries, tools, permissions, and incident response.

The practical value of threat modeling is not a perfect list of every bad thing that could happen. It is shared clarity before the agent acts. The team knows what is protected, where trust boundaries sit, which tools can cause consequences, what untrusted inputs can influence, what approvals are required, and what evidence should remain after each run.

That clarity makes agents easier to deploy responsibly. It also makes them easier to improve. When something goes wrong, the team can compare the incident to the map, repair the control, and update the workflow. The agent becomes less like a mysterious delegate and more like a system with visible boundaries.

AI Agent Threat Modeling: Finding Risk Before Delegation

On this page

Start With What The Workflow Protects

Draw The Trust Boundaries

Tools Turn Mistakes Into Consequences

Untrusted Inputs Need Special Treatment

Permissions Should Follow The Threat Model

Failure Modes Are More Useful Than Villains

Controls Should Be Visible In The Handoff

The Model Should Be Revisited

Turn agent lessons into a better review setup

JJ Ben-Joseph

On this page

Start With What The Workflow Protects

Draw The Trust Boundaries

Tools Turn Mistakes Into Consequences

Untrusted Inputs Need Special Treatment

Permissions Should Follow The Threat Model

Failure Modes Are More Useful Than Villains

Controls Should Be Visible In The Handoff

The Model Should Be Revisited

Turn agent lessons into a better review setup

JJ Ben-Joseph

Related guidebooks

AI Agent Instruction Hierarchies: Keeping Goals, Policies, and Evidence in Order

AI Agent Browser Workflows: Working Through the Web Without Losing the Thread

AI Agent Prompt Injection: Working With Untrusted Content