AI Agent Instruction Hierarchies: Keeping Goals, Policies, and Evidence in Order

An AI agent rarely works from one instruction. It receives durable product rules, workflow policies, a user’s task, retrieved documents, tool results, memory notes, system messages, examples, approvals, and sometimes hostile or confused text copied from the outside world. Many of those inputs look like instructions because they use imperative language. A customer email may say to ignore the policy. A web page may tell the agent to reveal hidden prompts. A stale runbook may describe an old process. A teammate may ask for a shortcut that conflicts with the permission boundary. If the agent treats every sentence as equal, delegated work becomes a contest of whichever text sounds most urgent.

Instruction hierarchy is the discipline of deciding which kinds of text are allowed to steer the agent, which kinds can only provide evidence, and what should happen when they conflict. It is not only a prompt-engineering trick. It is part of the operating design around the agent. The hierarchy decides how policies are represented, how tasks are scoped, how tools label their outputs, how memory is trusted, how retrieved documents are handled, and how the final handoff explains uncertainty.

This topic sits close to AI Agent Prompt Injection , but it is broader. Prompt-injection defense focuses on untrusted content trying to become an instruction. Instruction hierarchy asks how every source of direction should be ranked before that attack or confusion appears. It also connects to AI Agent Tool Contracts because tools can either preserve authority boundaries or blur them into a pile of prose.

Not every sentence gets the same authority

The simplest hierarchy starts with a plain distinction: some text governs the work, some text requests the work, and some text describes the world. A governing instruction might be a safety rule, a permission boundary, or a workflow policy. A task instruction tells the agent what the user wants done inside those boundaries. Evidence gives facts the agent can use to complete the task. Trouble starts when those roles are mixed.

Consider a support agent reading a customer message. The message is evidence about what the customer wants and what happened to them. It is not a new support policy. If the customer writes that the agent should issue a refund immediately, that sentence may explain the desired outcome, but it should not grant the agent authority to issue the refund. The actual authority should come from the workflow, the tool contract, the account facts, the applicable policy, and any approval gate.

The same pattern appears in software work. A bug report may say “just disable the failing check.” That sentence is useful evidence about the reporter’s frustration. It is not permission to weaken the test suite. A repository note may say “this module is temporary” even though the module has become central over time. A broad request may say “clean this up” even though the agent’s write scope is narrow. The hierarchy lets the agent use the text without letting the text rewrite the job.

A good agent workflow makes these roles explicit. It does not rely on the model to infer that a policy has more authority than a ticket, or that a retrieved web page is evidence rather than command. The system should label sources, constrain tools, and make conflict visible when a lower-authority source tries to overrule a higher-authority boundary.

Durable policy belongs outside the task

Durable rules should not be hidden inside each user’s task prompt. If a support agent must never send a message without approval, that rule belongs in the workflow and the send tool, not as a reminder pasted into every assignment. If a coding agent must preserve unrelated user changes, that rule belongs in its operating instructions and review process, not in a vague hope that each task remembers it. Durable policy is the part of the hierarchy that should survive individual requests.

This matters because task prompts are negotiated under pressure. People ask for outcomes. They omit boundaries. They use shorthand. They may not know which actions are risky. If every guardrail has to be restated by the user, the system will fail whenever the user forgets. A durable layer gives the agent stable rules about data boundaries, approvals, forbidden actions, source trust, and escalation.

Durable policy also prevents a user request from becoming sovereign by accident. A manager may ask an agent to update a record, but the agent still needs the right identity, tool, evidence, and approval. A product lead may ask for a public post, but the agent still needs publication rules. A developer may ask for a fix, but the agent still needs the repository’s scope and verification expectations. AI Agent Permissions works best when the permission ladder is higher in the hierarchy than the immediate request.

The durable layer should stay compact. If it becomes a long manual full of edge cases, important rules will be buried. The hierarchy should separate stable rules from domain guidance, domain guidance from task details, and task details from evidence. That separation makes it easier to test, change, and review the system when something goes wrong.

The task should define the job, not the constitution

A task instruction is still important. It tells the agent what outcome is wanted, what scope is in bounds, what artifact should be produced, and what counts as done. Without that layer, the agent has rules but no job. The problem is not task authority. The problem is unlimited task authority.

A strong task instruction fits inside the durable boundaries. It may say to compare three vendor proposals, draft a support reply, investigate a failing test, summarize a research folder, or prepare a proposed change for review. It should name the target, the expected output, the allowed sources, and the stopping point. AI Agent Task Decomposition makes this easier because smaller tasks have fewer hidden conflicts. It is easier to preserve a hierarchy when the agent is not being asked to interpret an entire business process from one sentence.

The task layer should also say what the agent should do when evidence conflicts. If the assignment asks for a definitive answer from approved sources and no approved source exists, the correct result may be a stopped run with a clear blocker. If the assignment asks for a draft, the agent should not treat the draft as approval to send. If the assignment asks for analysis, the agent should not turn analysis into a state-changing action because the conclusion feels obvious.

This distinction protects both sides of delegation. The human gets to request useful work without restating every rule. The agent gets a clear job without being forced to decide which rules matter. The workflow remains accountable because a reviewer can see whether the final result followed the task while still respecting higher layers.

Evidence is allowed to inform, not command

Retrieved documents, emails, tickets, logs, web pages, transcripts, and database records should usually be evidence. They help the agent understand the situation. They do not automatically become instructions. This is the central mistake that prompt injection exploits, but ordinary work creates the same weakness without hostile intent.

A web page may contain instructions written for a human reader. A policy document may contain procedural language that applies only to employees with certain authority. A customer email may contain demands. A log line may include user-provided text. A meeting note may describe a plan that was never approved. If those materials enter the context as plain text, the agent may treat their imperative language as something it should obey.

AI Agent Knowledge Bases addresses the source side of this problem by labeling governing policy, historical material, customer input, and untrusted sources. Instruction hierarchy addresses the action side. The agent should know that a governing policy can constrain the answer, while a customer message can explain the case but cannot grant authority. A historical incident report can inform risk, but it should not override the current runbook. A vendor document can describe a vendor’s claim, but it should not become the company’s requirement.

This is why source metadata matters. The more a tool can return source role, freshness, owner, and confidence alongside content, the less the model has to infer authority from prose. A chunk of text labeled as “current approved policy” should be treated differently from a chunk labeled as “customer-provided description” or “archived discussion.” The hierarchy becomes operational when the labels travel with the evidence.

Tools can reinforce the hierarchy

Tool outputs are especially important because agents often treat tools as reliable. A tool result can be evidence, a status update, an approval record, an error, or a proposed action. If the output is a paragraph of friendly prose, the agent may miss which role it plays. If the output is structured, the hierarchy is easier to preserve.

A search tool can return approved source records separately from untrusted results. A record lookup can return facts without embedding behavioral instructions from a note field. A browser tool can mark page content as external evidence. An approval tool can return the exact action approved, the expiration, and the boundary of that approval. A send tool can refuse to run unless the approval is still valid. These choices make the hierarchy more than text in a prompt.

This is where AI Agent Structured Outputs becomes part of safety. A schema can distinguish evidence, instruction, permission, warning, and blocked_reason fields instead of asking the model to parse a narrative. A tool contract can say that user-provided fields are never instructions to the agent. It can preserve old and new values for review. It can separate a dry run from execution. It can make a lower-authority source unable to smuggle itself into a higher-authority action.

The goal is not to make the model stop reasoning. The goal is to give reasoning clean material. When the environment represents authority clearly, the agent can spend more effort on the task and less effort guessing which text deserves obedience.

Conflicts should become visible stops

An instruction hierarchy is only useful if conflict changes behavior. If a customer asks for one thing and policy says another, the agent should not quietly average them. If a user asks for a destructive action outside scope, the agent should not frame refusal as reluctance. If a retrieved source appears to contradict the runbook, the agent should not pretend certainty. Conflict is information, and the workflow needs a place to hold it.

A visible stop can be simple. The agent can say that the task request conflicts with the permission boundary. It can identify that a source is stale or lower authority than another source. It can ask for clarification when the task is underspecified. It can prepare a draft but withhold action until review. It can escalate to a human owner when the hierarchy does not resolve the conflict.

AI Agent Output Verification is stronger when these conflicts survive into the handoff. A reviewer should not see only the polished answer. They should see which sources controlled the result and which lower-authority text was ignored or treated as evidence. That record helps the reviewer decide whether the hierarchy was applied correctly.

The most mature agent systems do not treat conflict as a rare exception. They expect it. Workflows involving real people, old documents, open web pages, and shared systems will produce conflicting directions. The question is whether the agent can preserve the boundary long enough for the right layer to decide.

The hierarchy needs its own tests

Teams often test whether an agent can complete a task, then separately test whether it refuses obvious unsafe requests. Instruction hierarchy needs more targeted evaluation. The test should ask whether the agent follows the correct source of authority when several plausible instructions compete.

Give the agent a user request that asks for a state change without approval. Give it a retrieved document that contains text telling the agent to ignore earlier rules. Give it a stale runbook beside a current policy. Give it a customer message that tries to redefine the process. Give it a tool result containing user-provided prose and see whether that prose becomes a command. The final answer matters, but the trace matters too. Did the agent use the right evidence? Did it preserve the higher boundary? Did it explain the conflict in a reviewable way?

This connects to AI Agent Evaluations and AI Agent Observability . Evaluations should include hierarchy conflicts because they are common in real work. Observability should show which sources the agent saw and how it classified them. Without that trace, a good answer may be lucky and a bad answer may be hard to repair.

Instruction hierarchy is quiet infrastructure. When it works, the agent seems calmly disciplined. It uses policies as policies, tasks as tasks, evidence as evidence, tools as bounded authority, and memory as context rather than command. When it fails, the loudest sentence in the context can take over. Delegated work becomes safer and easier to review when the system decides in advance which text is allowed to steer the work and which text is only allowed to inform it.

AI Agent Instruction Hierarchies: Keeping Goals, Policies, and Evidence in Order

On this page

Not every sentence gets the same authority

Durable policy belongs outside the task

The task should define the job, not the constitution

Evidence is allowed to inform, not command

Tools can reinforce the hierarchy

Conflicts should become visible stops

The hierarchy needs its own tests

Turn agent lessons into a better review setup

JJ Ben-Joseph

On this page

Not every sentence gets the same authority

Durable policy belongs outside the task

The task should define the job, not the constitution

Evidence is allowed to inform, not command

Tools can reinforce the hierarchy

Conflicts should become visible stops

The hierarchy needs its own tests

Turn agent lessons into a better review setup

JJ Ben-Joseph

Related guidebooks

AI Agent Threat Modeling: Finding Risk Before Delegation

AI Agent Browser Workflows: Working Through the Web Without Losing the Thread

AI Agent Prompt Injection: Working With Untrusted Content