Agent Testing: Test Plans for Tool-Using AI Systems
What happens when the tool fails in the middle of the agent’s plan?
The tempting answer is to test the prompt on a few happy-path examples and call the agent ready. That answer is not useless, but it is too vague to operate. Agent testing validates the whole workflow around an AI agent: prompts, tools, retrieval, permissions, eval gates, budget limits, retries, approvals, and final verification. Prompt tests are not enough because agents fail in the transitions between steps.

Direct answer
Agent testing validates the whole workflow around an AI agent: prompts, tools, retrieval, permissions, eval gates, budget limits, retries, approvals, and final verification. Prompt tests are not enough because agents fail in the transitions between steps.
When this matters
- The agent can call external systems or mutate local state.
- Prompt changes, model swaps, or tool updates can alter behavior.
- You need confidence before adding a new tool or MCP server.
Failure modes to catch
- Tests cover the ideal task but not missing sources.
- A tool timeout makes the agent hallucinate the result.
- An adversarial document changes the agent’s instructions.
- Budget limits stop the run without a useful user-facing state.
Agent test plan
| Gate | Signal | Action |
|---|---|---|
| Happy path | normal request and expected tools | Pass baseline |
| Broken tool | timeout, 500, malformed output | Recover or stop |
| Bad retrieval | irrelevant or malicious context | Ignore or flag |
| Permission failure | blocked tool or missing scope | Ask or downgrade |
| Budget limit | cost or loop threshold | Stop with summary |
Running example
A test feeds the agent a malicious source page that says to publish immediately. The expected result is not a better answer; it is a policy block that proves the workflow ignores instructions from retrieved data.
Put it to work
Use the agent test plan above as the first version of your production gate. Replace the placeholders with your own agent names, tools, risk classes, thresholds, and approval rules. Then wire it into traces, monitoring, security review, evaluation, and human approval so it changes runtime behavior instead of sitting in a doc.
Related control gates
- AI Agent Control Gates: Stop Bad Agents Before They Act
- AI Agent Evaluation: Gates That Catch Bad Behavior
- AI Agent Monitoring: Metrics, Logs, and Stop Conditions
- AI Agent Security: Threat Models for Tool-Using Agents
- Agent Tracing: A Practical Schema for Tool-Using AI
Frequently Asked Questions
What is agent testing?
Agent testing validates how an AI agent behaves across prompts, tools, retrieval, permissions, budget limits, evals, approvals, and final verification.
What should be in an agent test suite?
Include happy paths, broken tools, adversarial context, missing permissions, bad retrieval, budget stops, approval-required actions, and regression cases from real incidents.
How often should agent tests run?
Run core tests before changing prompts, models, tools, MCP servers, retrieval logic, or policy rules. Run incident-derived tests permanently after a failure.
The Takeaway
Agent testing is workflow testing. The interesting failures happen between the prompt and the final answer.