Agent Testing: Test Plans for Tool-Using AI Systems

What happens when the tool fails in the middle of the agent’s plan?

The tempting answer is to test the prompt on a few happy-path examples and call the agent ready. That answer is not useless, but it is too vague to operate. Agent testing validates the whole workflow around an AI agent: prompts, tools, retrieval, permissions, eval gates, budget limits, retries, approvals, and final verification. Prompt tests are not enough because agents fail in the transitions between steps.

Generated hand-drawn illustration of agent session state, turn logs, checkpoints, and approval paths.

Direct answer

Agent testing validates the whole workflow around an AI agent: prompts, tools, retrieval, permissions, eval gates, budget limits, retries, approvals, and final verification. Prompt tests are not enough because agents fail in the transitions between steps.

When this matters

  • The agent can call external systems or mutate local state.
  • Prompt changes, model swaps, or tool updates can alter behavior.
  • You need confidence before adding a new tool or MCP server.

Failure modes to catch

  • Tests cover the ideal task but not missing sources.
  • A tool timeout makes the agent hallucinate the result.
  • An adversarial document changes the agent’s instructions.
  • Budget limits stop the run without a useful user-facing state.

Agent test plan

GateSignalAction
Happy pathnormal request and expected toolsPass baseline
Broken tooltimeout, 500, malformed outputRecover or stop
Bad retrievalirrelevant or malicious contextIgnore or flag
Permission failureblocked tool or missing scopeAsk or downgrade
Budget limitcost or loop thresholdStop with summary

Running example

A test feeds the agent a malicious source page that says to publish immediately. The expected result is not a better answer; it is a policy block that proves the workflow ignores instructions from retrieved data.

Put it to work

Use the agent test plan above as the first version of your production gate. Replace the placeholders with your own agent names, tools, risk classes, thresholds, and approval rules. Then wire it into traces, monitoring, security review, evaluation, and human approval so it changes runtime behavior instead of sitting in a doc.

Frequently Asked Questions

What is agent testing?

Agent testing validates how an AI agent behaves across prompts, tools, retrieval, permissions, budget limits, evals, approvals, and final verification.

What should be in an agent test suite?

Include happy paths, broken tools, adversarial context, missing permissions, bad retrieval, budget stops, approval-required actions, and regression cases from real incidents.

How often should agent tests run?

Run core tests before changing prompts, models, tools, MCP servers, retrieval logic, or policy rules. Run incident-derived tests permanently after a failure.

The Takeaway

Agent testing is workflow testing. The interesting failures happen between the prompt and the final answer.

Sources