AI agents loop because they have no native mechanism to detect that they are repeating themselves. The agent issues a tool call, receives an ambiguous result, issues the same tool call again, and continues indefinitely because nothing in the architecture tells it that cycling between the same two states is not progress. The loop runs until you kill the process or exhaust your API budget.
Analysis Briefing
- Topic: Infinite loop failure modes in LLM tool-use agents
- Analyst: Mike D (@MrComputerScience)
- Context: Originated from a live session with Claude Sonnet 4.6
- Source: Pithy Cyborg
- Key Question: Why does an agent keep retrying the same failing action instead of stopping?
Why LLM Agents Have No Native Loop Detection
LLM agents generate the next action based on their current context window. The context window contains the history of prior actions and their results. An agent that issued a tool call, received an ambiguous result, and is deciding what to do next is reasoning from that history.
The problem is that the agent’s decision to retry is locally reasonable at each step. The tool call returned an ambiguous result. Retrying a tool call that returned an ambiguous result is a sensible action. The agent takes it. The retry returns another ambiguous result. Retrying again is still locally reasonable. The agent takes it again.
Nothing in the agent’s context tells it that this is the fifteenth retry and that fifteen retries on the same ambiguous result constitutes a loop rather than persistent reasonable behavior. The agent has no global state representation that tracks retry count, cycle detection, or progress metrics. It reasons locally from each context state, and each local reasoning step concludes that another retry is warranted.
The Three Loop Patterns That Drain Budgets in Production
Same-tool retry loops are the first and simplest pattern. The agent calls a tool, the tool returns an error or ambiguous output, the agent calls the same tool again with the same or slightly modified parameters, and continues until the context window fills or the process is killed. This pattern appears when agents handle tool errors by retrying rather than by escalating or abandoning.
Oscillation loops are the second pattern and harder to detect. The agent alternates between two actions: action A produces a state that makes action B seem correct, action B produces a state that makes action A seem correct. The agent cycles between them indefinitely. Neither action is failing. Both are succeeding and producing outputs that are locally correct. The loop is only visible at the global level where no progress is being made.
Re-planning loops are the third and most expensive pattern in terms of token cost. The agent fails at an execution step, decides to re-plan its approach, executes the new plan, fails again at a similar step, re-plans again, and continues. Each re-planning cycle generates a full new plan in the context window. The context fills rapidly. Token costs escalate. The agent is doing a lot of work and making no progress.
The Infrastructure Fixes That Prevent Loops Before They Drain Your Budget
Step count limits are the first and most important infrastructure fix. Every agent deployment should have a hard maximum on the number of actions the agent can take before the pipeline terminates and returns control to a human or a fallback handler. The limit should be set based on the maximum reasonable number of steps for the task, not on some large default. An agent that requires more than 20 steps to complete a task that normally takes 5 steps has looped, regardless of whether the loop is detectable from the action sequence.
Repeated action detection is the second fix. Logging the full action history and comparing each new proposed action against prior actions in the same session catches same-tool retry loops and oscillation loops before they run for many cycles. An agent that proposes an action identical or near-identical to one it already took should trigger a loop detection flag rather than executing.
Explicit progress metrics in the agent’s context are the third fix. Adding a structured progress representation to the agent’s context that it updates at each step, tracking what has been completed, what has been attempted and failed, and what remains, gives the agent the global state visibility that its architecture otherwise lacks. An agent that can see its own progress record is less likely to retry a failed action indefinitely because the failure is explicitly represented in its context rather than implicit in its action history.
What This Means For You
- Set hard step count limits on every agent deployment before going to production. Define the maximum reasonable step count for your task and terminate at 2x that number. An agent that exceeds 2x the expected step count has looped, and no amount of additional steps will resolve a loop.
- Log and compare action sequences in real time. Flag and terminate any agent session where the same action appears more than three times in the action history. Same-action repetition is loop behavior regardless of whether individual steps are returning errors.
- Add explicit progress tracking to your agent’s context. Structure the context to include a completed actions list, a failed actions list, and a remaining tasks list. Agents with explicit progress records loop less frequently than agents whose only state representation is their raw action history.
- Build escalation handlers for ambiguous tool results. Rather than retrying on ambiguity, route ambiguous tool results to a clarification step, a human review gate, or a fallback action. Retry logic without a retry limit is loop infrastructure.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
