Skip to main content
Back to Blog

ENGINEERING

Your Agent Crashes Between Tools, Not Inside Them

Apr 9, 20265 min readPulseon Team

Your agent crashes between tools, not inside them.

Hendricks.ai published a number last quarter: 89% of AI agent projects never reach production. That is a brutal stat, but the more useful question is why.

The answer founders almost always get wrong: it is not the model. Switching from GPT-4o to Claude Sonnet will not save you. Adding more context will not fix it either. The failure almost always lives in the transitions between tools, not inside any single tool call.

What transition failure actually means

An agent makes a tool call. The tool returns. The agent has to decide what to do with that result and move to the next step.

That transition is where most production crashes happen.

The failure looks like this: the tool returns a partial result because an upstream API timed out. The agent treats the partial result as complete and passes it forward. The next tool call is built on a corrupted assumption. By step 5, you are debugging an output that is completely wrong, and your logs show every individual tool call "succeeded."

Or: two parallel agent threads write to the same state object. One finishes first, sets a field, then the second overwrites it with a stale value. The agent did not crash. It produced a confident wrong answer because the state was silently corrupted in transit.

Neither of these failures appears in your model evaluation metrics. They only surface in production, at scale, under real concurrency.

The three transition failure patterns

After shipping agents in production, the same patterns keep showing up.

Stale state from parallel writes. Multiple agents or agent loops writing to shared state without versioning. One thread reads what another just overwrote. This is invisible in testing because you rarely run true concurrent load in a staging environment.

Partial tool outputs treated as complete. Agents are optimistic by default. If a tool returns something, the agent moves forward. When you build agents, you assume your tools will either succeed or throw a clear error. Real production tools often do neither. They return incomplete data silently. You need explicit completeness validation before state transitions, not just exception handling.

Prompt drift across retries. When an agent retries a failing step, it carries the conversation history forward, including the failed attempt. The next model call is now operating on a context that contains the error, the failed tool output, and the original task. Without explicit state pruning between retries, the model starts reasoning around the failure instead of past it. You get creative wrong answers instead of clean failures.

What Anthropic's tooling data shows

Anthropic's internal testing showed that 58 tool definitions consume roughly 55,000 tokens of input context. Most teams building agents do not account for this. They focus on conversation history as the main context cost and treat tool definitions as cheap. They are not. Every tool call costs you not just its own output, but the entire tool schema loaded into context on each invocation.

That has a direct architectural consequence: do not give your agent access to all tools at all times. Scope tools dynamically based on which step the agent is in. An agent at step 2 of a 5-step workflow does not need tools from steps 3 through 5. Scoping tools by state saves tokens and reduces the surface area where wrong tool selection can happen.

The fix is boring

You do not solve transition failures with a better model. You solve them with explicit state management.

Every agent running in production at meaningful scale owns its own state object with versioned writes. State transitions get logged before and after, not during. Retries do not just retry the tool call. They prune the context back to the pre-failure state, re-run validation, then attempt the step fresh.

The teams that get this right treat their agent state machine the same way a backend engineer treats a distributed transaction. You assume failure. You design for it before you ship. You define exactly what valid state looks like at each step and reject anything that does not match before passing it forward.

Four questions to run against your current agent

If you are running agents in production today, answer these:

What does your agent do when a tool returns a 200 with an empty or partial payload? If the answer is "it moves forward," you have a silent failure mode in production right now.

Are multiple agent threads writing to shared state concurrently? If yes, do you have versioning or locking on those writes?

What does your retry logic do to the conversation history? Does it carry the failed tool output forward, or does it prune back to a clean state before retrying?

How many tools does your agent have access to at each step? If the answer is always the full list, you are paying for tokens you do not need and giving the model more ways to pick the wrong tool.

The model is probably fine. The state machine is where you have real work to do.

READY TO START?

Have an idea?
Let's ship it.

From napkin sketch to production deploy. Tell us what you're building and we'll tell you how fast we can ship it.