AI agent failures are not always dramatic. Sometimes the agent simply tries again. A search times out, so it searches again. A ticket update returns an ambiguous response, so it posts the note again. A queue worker restarts halfway through a task and repeats the last tool call. A human reviewer approves a prepared action, but the approval callback is delayed, so the orchestration layer resubmits the same request. In a pure reading workflow, repetition is usually harmless. In a system that sends messages, changes records, moves files, opens pull requests, creates invoices, or modifies customer state, repetition becomes one of the ordinary ways delegated work creates mess.
That is why retries and idempotency belong in the design of agent workflows, not only in backend infrastructure. An idempotent action is one that can be requested more than once without multiplying the effect. The word sounds specialized, but the operational idea is simple: if the agent asks for the same real-world action twice, the system should recognize the repeat and return the same result instead of doing the action again. The agent may still be uncertain, the network may still fail, and the model may still choose poorly, but the surrounding system should not turn one intended action into three side effects.
This topic sits close to AI Agent Tool Contracts , because tools are where repeated actions become concrete. It also sits close to AI Agent Checkpoints , because a resumed run needs to know what has already happened. The narrow question here is what happens when the agent, the tool, the queue, or the human loop cannot tell whether the last step completed.
Retrying Is Not the Same as Repeating
A retry is an operational response to uncertainty. The agent or orchestrator expected a result and did not receive one in a usable form. The tool may have timed out. The model call may have been interrupted. A browser session may have crashed after submitting a form. A queue worker may have died after writing to one system and before writing the checkpoint. From the outside, the workflow sees an incomplete step. The dangerous assumption is that incomplete visibility means incomplete execution.
People make this mistake too, but agents make it easier to scale. A human who is unsure whether an email sent may open the sent folder before trying again. An agent may not have that habit unless the workflow gives it a way to check. A human may remember that a ticket was already updated a minute ago. A resumed agent may not see that memory unless the state is saved. A human may hesitate before clicking a payment button twice. A queue runner may only see a failed job and follow its retry policy.
The right question is not “Should agents retry?” They often must. Networks fail, tools are busy, and long-running work gets interrupted. The better question is which steps are safe to retry blindly, which steps need a lookup before retrying, and which steps should never be retried without a person or a recovery procedure. Reading a document, searching a knowledge base, rendering a preview, or running a local test can usually be repeated. Sending a message, creating a public record, charging an account, deleting a file, or merging a change should be treated as state-changing work that needs a stronger contract.
The Action Needs an Identity
Idempotency begins by giving the intended action an identity before the side effect happens. If an agent is about to send a customer reply, the workflow should not merely say “send this reply.” It should create a stable request identity tied to the ticket, the draft version, the recipient, the channel, and the approval that authorized it. If the same request arrives again, the sending tool can say, in effect, “this action was already completed under this identity,” and return the previous send record instead of producing a second message.
The identity does not have to be visible to the model as a cryptic technical detail, but it has to exist in the system. It might be an idempotency key, an action ID, a proposed change ID, a draft ID, or a workflow step ID. The important part is that the identity represents the action the human and the system meant to perform, not just the moment the agent happened to call a tool. A random key generated separately for every retry defeats the purpose. A key that is too broad can suppress distinct actions that should both happen. A key that is too narrow can let duplicates slip through.
Consider an agent that updates a project management ticket after reading a pull request. If the agent posts a comment saying that tests passed, the identity might bind together the ticket ID, the commit SHA, the test run ID, and the comment purpose. If the agent retries after a timeout, the tool can find the existing comment for that same purpose and commit. If a new commit arrives later, the identity changes because the action is genuinely different. This is not just backend neatness. It helps the agent’s own reasoning, because the tool can return a stable record instead of leaving the model to infer what happened from silence.
Prepare and Commit Are Different Moves
Many agent workflows become safer when they separate preparation from execution. The agent can draft a message, assemble a change, compute a patch, or propose a record update without immediately applying it. That prepared artifact receives an identity. A reviewer, policy check, or automated gate can inspect it. Only then does a narrower execution tool commit the artifact to the outside world.
This split is useful because most repeated work belongs on the preparation side. It is fine for an agent to regenerate a draft, re-run a comparison, or update a proposed patch before it is accepted. The side effect should occur once, at the commit boundary. If execution fails ambiguously, the workflow can ask whether the prepared artifact was committed rather than asking the model to reconstruct the whole task.
The pattern also clarifies approval. Human Review for AI Agents explains why review must be attached to inspectable work, not a vague sense that the agent is doing something reasonable. Idempotency adds another constraint: the approval should authorize a specific prepared artifact or a clearly bounded class of changes. If the agent changes the draft after approval, the commit identity should no longer match the approval. If the commit is retried, the execution tool should connect the retry to the same approved artifact rather than asking for a fresh approval merely because the network was noisy.
Partial Failure Is the Normal Case to Design For
A simple success-or-failure story is rarely enough. A tool may update the primary record and fail to write the audit event. It may send the external message and fail before returning the message ID. It may create a file in storage and fail before linking it to the task. It may open a pull request and fail while fetching the URL. From the agent’s perspective, the result may look like an error. From the world’s perspective, the action may already have happened.
Good tools make partial state visible. A state-changing tool should be able to answer a follow-up question: given this action identity, what happened? The answer might be not started, prepared, committed, committed with a missing secondary record, failed before side effect, failed after side effect, or unknown and requiring human investigation. These states do not need to be exposed as a long taxonomy to every user, but the workflow needs more precision than a generic exception.
This is where retries connect to AI Agent Observability . A trace is not only a debug log after something goes wrong. It is the live evidence the workflow uses to avoid making things worse. If the trace records the action ID, tool call, target object, returned artifact, and checkpoint status, a resumed agent has a chance to recover cleanly. If the trace only says “tool failed,” the next run may repeat the dangerous part because it cannot see that the side effect already escaped.
The Agent Should Check Before It Acts Again
A retry-safe workflow gives the agent a way to check state before repeating state-changing work. That check should be easier than improvising. If the agent posts comments, it should have a way to find its prior comments for the same task. If it creates tickets, it should be able to search by source event and action identity. If it edits a document, it should be able to inspect the current version and the previous patch. If it sends messages, it should be able to ask the sending service whether the prepared message was delivered.
The check must also be scoped. Asking the agent to search all history with natural language invites mistakes. A better tool can ask for the action identity and return the current known state. If the state is committed, the agent can continue from the result. If the state is prepared but not committed, it can decide whether the next gate is still valid. If the state is unknown, the workflow can stop and ask for help rather than treating uncertainty as permission to try again.
This habit changes the tone of agent operations. The agent is no longer rewarded for smooth forward motion at all costs. It is allowed, and sometimes required, to pause when the system cannot prove what happened. That pause may feel conservative, but it is cheaper than duplicate customer messages, conflicting record updates, or a cleanup project caused by an eager retry loop.
Queues Need Boundaries, Not Just Backoff
Many retries are not chosen by the model. They are chosen by queues, schedulers, worker frameworks, browser automation layers, or orchestration code. Those systems often know how to retry with exponential backoff, but they do not automatically know which actions are safe to replay. Agent workflows therefore need to put semantic boundaries around queued steps.
A queue job that only says “continue the agent run” is hard to reason about. A queue job that says “commit prepared action A if it is still approved and not already committed” is much safer. The worker can check the action state, confirm the approval, perform the commit once, and write the result. If the worker crashes, the next worker repeats the same bounded commit request instead of re-running the whole agent conversation.
This matters for cost and latency as well as safety. AI Agent Cost, Latency, and Queues focuses on the operating budget behind delegated work. Idempotency keeps that budget from being burned on useless repeats. A workflow that can resume from a named step does not need to ask the model to re-read every source. A commit tool that can return the previous result does not need to call an external service again. A queue that knows a step is already complete can move on instead of making the agent rediscover its own history.
Duplicate Prevention Should Be Visible to Reviewers
Idempotency can fail quietly if it is treated as hidden plumbing. Reviewers need to see enough of the duplicate-prevention story to trust the action. They do not need every internal key, but they should be able to tell whether the action is new, already completed, superseded, or blocked because the current state does not match the approved state.
For example, a reviewer approving an agent-prepared customer reply should see the draft version and the target ticket. After the send, the review surface should show the delivery record. If the agent or worker retries, the system should show that the retry returned the existing delivery record rather than sending again. This small piece of evidence prevents a familiar review failure: the human sees a polished final summary but cannot tell whether the workflow actually controlled the side effect.
The same principle applies to engineering work. If an agent opens a pull request, subsequent retries should not open a chain of duplicate pull requests with similar titles. The workflow should preserve the branch, pull request number, commit state, and review context. If the agent needs to update the work after feedback, that is a new edit to the same artifact, not a replay of creation. The reviewer should see continuity rather than a trail of near-duplicates.
Recovery Is Part of the Contract
Even careful idempotency design will not remove every messy case. A third-party system may lack reliable duplicate detection. A browser may submit a form with no stable transaction ID. A legacy tool may return success before all downstream effects are visible. Some actions are not meaningfully reversible. The goal is not to pretend the system can make every repeated action harmless. The goal is to know where repetition is safe, where it is guarded, and where it must stop.
That recovery plan belongs beside the tool contract. A tool should say what identifier it uses for duplicate detection, what result it returns on repeat, what state it can inspect after an ambiguous failure, and what humans should do when the state cannot be determined. For high-impact actions, the contract should also name the rollback or compensation path. AI Agent Incident Response becomes much easier when every risky tool already leaves behind enough evidence to answer the first operational question: did the action happen once, more than once, or not at all?
The mature version of this design is not dramatic. It is a collection of ordinary habits. Give every meaningful action an identity. Separate preparation from execution. Bind approvals to artifacts. Return stable records from state-changing tools. Check action state before retrying. Make partial failure visible. Let queues repeat bounded steps instead of whole conversations. Leave enough evidence that a reviewer can tell the difference between a new action and a repeated request.
AI agents will keep encountering uncertainty. They will lose connections, hit tool errors, resume from checkpoints, and work through queues. The question is whether uncertainty turns into repeated side effects. Idempotency is one of the quiet engineering disciplines that keeps delegated work from becoming fragile. It does not make the agent wise. It makes the world around the agent less willing to do the same risky thing twice.



