AI Agent Evaluation Metrics
AI agent evaluation metrics are purpose-built measurements that assess how well autonomous LLM systems reason, plan, execute tools, and complete tasks. Unlike traditional LLM metrics that evaluate single input-output pairs, AI agent evaluation metrics analyze the entire execution trace—capturing every reasoning step, tool call, and intermediate decision your agent makes.
These metrics matter because AI agents fail in fundamentally different ways than simple LLM applications. An agent might select the right tool but pass wrong arguments. It might create a brilliant plan but fail to follow it. It might complete the task but waste resources on redundant steps. AI agent evaluation metrics give you the granularity to pinpoint exactly where things go wrong.
For a broader overview of AI agent evaluation concepts and strategies, see the AI Agent Evaluation guide.
AI agent evaluation metrics in deepeval operate on execution traces—the full record of your agent's reasoning and actions. This requires setting up tracing to capture your agent's behavior.
The Three Layers of AI Agent Evaluation
AI agents consist of interconnected layers that each require distinct evaluation approaches:
| Layer | What It Does | Key Metrics |
|---|---|---|
| Reasoning Layer | Plans tasks, creates strategies, decides what to do | PlanQualityMetric, PlanAdherenceMetric |
| Action Layer | Selects tools, generates arguments, executes calls | ToolCorrectnessMetric, ArgumentCorrectnessMetric |
| Execution Layer | Orchestrates the full loop, completes objectives | TaskCompletionMetric, StepEfficiencyMetric |
Each metric targets a specific failure mode. Together, they provide comprehensive coverage of everything that can go wrong in an AI agent pipeline.
Reasoning Layer Metrics
The reasoning layer is where your agent analyzes tasks, formulates plans, and decides on strategies. Poor reasoning leads to cascade failures—even perfect tool execution can't save an agent with a flawed plan.
Plan Quality Metric
The PlanQualityMetric evaluates whether the plan your agent generates is logical, complete, and efficient for accomplishing the given task. It extracts the task and plan from your agent's trace and uses an LLM judge to assess plan quality.
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.metrics import PlanQualityMetric
@observe(type="tool")
def search_flights(origin, destination, date):
return [{"id": "FL123", "price": 450}, {"id": "FL456", "price": 380}]
@observe(type="agent")
def travel_agent(user_input):
# Agent reasons: "I need to search for flights first, then book the cheapest"
flights = search_flights("NYC", "Paris", "2025-03-15")
cheapest = min(flights, key=lambda x: x["price"])
return f"Found cheapest flight: {cheapest['id']} for ${cheapest['price']}"
# Initialize metric
plan_quality = PlanQualityMetric(threshold=0.7, model="gpt-4o")
# Evaluate agent with plan quality metric
dataset = EvaluationDataset(goldens=[Golden(input="Find me the cheapest flight to Paris")])
for golden in dataset.evals_iterator(metrics=[plan_quality]):
travel_agent(golden.input)
When to use it: Use PlanQualityMetric when your agent explicitly reasons about how to approach a task before taking action. This is common in agents that use chain-of-thought prompting or expose their planning process.
How it's calculated:
The metric extracts the task (user's goal) and plan (agent's strategy) from the trace, then uses an LLM to score how well the plan addresses the task requirements.
If no plan is detectable in the trace—meaning the agent doesn't explicitly reason about its approach—the metric passes with a score of 1 by default.
→ Full Plan Quality documentation
Plan Adherence Metric
The PlanAdherenceMetric evaluates whether your agent follows its own plan during execution. Creating a good plan is only half the battle—an agent that deviates from its strategy mid-execution undermines its own reasoning.
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.metrics import PlanAdherenceMetric
@observe(type="tool")
def search_flights(origin, destination, date):
return [{"id": "FL123", "price": 450}, {"id": "FL456", "price": 380}]
@observe(type="tool")
def book_flight(flight_id):
return {"confirmation": "CONF-789", "flight_id": flight_id}
@observe(type="agent")
def travel_agent(user_input):
# Plan: 1) Search flights, 2) Book the cheapest one
flights = search_flights("NYC", "Paris", "2025-03-15")
cheapest = min(flights, key=lambda x: x["price"])
booking = book_flight(cheapest["id"])
return f"Booked flight {cheapest['id']}. Confirmation: {booking['confirmation']}"
# Initialize metric
plan_adherence = PlanAdherenceMetric(threshold=0.7, model="gpt-4o")
# Evaluate whether agent followed its plan
dataset = EvaluationDataset(goldens=[Golden(input="Book the cheapest flight to Paris")])
for golden in dataset.evals_iterator(metrics=[plan_adherence]):
travel_agent(golden.input)
When to use it: Use PlanAdherenceMetric alongside PlanQualityMetric when evaluating agents with explicit planning phases. If your agent creates multi-step plans, this metric ensures it actually follows through.
How it's calculated:
The metric extracts the task, plan, and actual execution steps from the trace, then uses an LLM to evaluate how faithfully the agent adhered to its stated plan.
Combine PlanQualityMetric and PlanAdherenceMetric together—a high-quality plan that's ignored is as problematic as a poor plan that's followed perfectly.
→ Full Plan Adherence documentation
Action Layer Metrics
The action layer is where your agent interacts with external systems through tool calls. This is often where things go wrong—even state-of-the-art LLMs struggle with tool selection, argument generation, and call ordering.
Tool Correctness Metric
The ToolCorrectnessMetric evaluates whether your agent selects the right tools and calls them correctly. It compares the tools your agent actually called against a list of expected tools.
from deepeval.tracing import observe, update_current_span
from deepeval.dataset import Golden, EvaluationDataset, get_current_golden
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall
# Initialize metric
tool_correctness = ToolCorrectnessMetric(threshold=0.7)
@observe(type="tool")
def get_weather(city):
return {"temp": "22°C", "condition": "sunny"}
# Attach metric to the LLM component where tool decisions are made
@observe(type="llm", metrics=[tool_correctness])
def call_llm(messages):
# LLM decides to call get_weather tool
result = get_weather("Paris")
# Update span with tool calling information for evaluation
update_current_span(
input=messages[-1]["content"],
output=f"The weather is {result['condition']}, {result['temp']}",
expected_tools=get_current_golden().expected_tools
)
return result
@observe(type="agent")
def weather_agent(user_input):
return call_llm([{"role": "user", "content": user_input}])
# Evaluate
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather in Paris?", expected_tools=[ToolCall(name="get_weather")])])
for golden in dataset.evals_iterator():
weather_agent(golden.input)
When to use it: Use ToolCorrectnessMetric when you have deterministic expectations about which tools should be called for a given task. It's particularly valuable for testing tool selection logic and identifying unnecessary tool calls.
How it's calculated:
The metric supports configurable strictness:
- Tool name matching (default) — considers a call correct if the tool name matches
- Input parameter matching — also requires input arguments to match
- Output matching — additionally requires outputs to match
- Ordering consideration — optionally enforces call sequence
- Exact matching — requires
tools_calledandexpected_toolsto be identical
When available_tools is provided, the metric also uses an LLM to evaluate whether your tool selection was optimal given all available options. The final score is the minimum of the deterministic and LLM-based scores.
→ Full Tool Correctness documentation
Argument Correctness Metric
The ArgumentCorrectnessMetric evaluates whether your agent generates correct arguments for each tool call. Selecting the right tool with wrong arguments is as problematic as selecting the wrong tool entirely.
from deepeval.tracing import observe, update_current_span
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.metrics import ArgumentCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall
# Initialize metric
argument_correctness = ArgumentCorrectnessMetric(threshold=0.7, model="gpt-4o")
@observe(type="tool")
def search_flights(origin, destination, date):
return [{"id": "FL123", "price": 450}, {"id": "FL456", "price": 380}]
# Attach metric to the LLM component where arguments are generated
@observe(type="llm", metrics=[argument_correctness])
def call_llm(user_input):
# LLM generates arguments for tool call
origin, destination, date = "NYC", "London", "2025-03-15"
flights = search_flights(origin, destination, date)
# Update span with tool calling details for evaluation
update_current_span(
input=user_input,
output=f"Found {len(flights)} flights",
)
return flights
@observe(type="agent")
def flight_agent(user_input):
return call_llm(user_input)
# Evaluate - metric checks if arguments match what input requested
dataset = EvaluationDataset(goldens=[
Golden(input="Search for flights from NYC to London on March 15th")
])
for golden in dataset.evals_iterator():
flight_agent(golden.input)
When to use it: Use ArgumentCorrectnessMetric when correct argument values are critical for task success. This is especially important for agents that interact with APIs, databases, or external services where incorrect arguments cause failures.
How it's calculated:
Unlike ToolCorrectnessMetric, this metric is fully LLM-based and referenceless—it evaluates argument correctness based on the input context rather than comparing against expected values.
The ArgumentCorrectnessMetric uses an LLM to determine correctness, making it ideal for cases where exact argument values aren't predetermined but should be logically derived from the input.
→ Full Argument Correctness documentation
Execution Layer Metrics
The execution layer encompasses the full agent loop—reasoning, acting, observing, and iterating until task completion. These metrics assess the end-to-end quality of your agent's behavior.
Task Completion Metric
The TaskCompletionMetric evaluates whether your agent successfully accomplishes the intended task. This is the ultimate measure of agent success—did it do what the user asked?
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.metrics import TaskCompletionMetric
@observe(type="tool")
def search_flights(origin, destination, date):
return [{"id": "FL123", "price": 450}, {"id": "FL456", "price": 380}]
@observe(type="tool")
def book_flight(flight_id):
return {"confirmation": "CONF-789", "flight_id": flight_id}
@observe(type="agent")
def travel_agent(user_input):
flights = search_flights("NYC", "LA", "2025-03-15")
cheapest = min(flights, key=lambda x: x["price"])
booking = book_flight(cheapest["id"])
return f"Booked flight {cheapest['id']} for ${cheapest['price']}. Confirmation: {booking['confirmation']}"
# Initialize metric - task can be auto-inferred or explicitly provided
task_completion = TaskCompletionMetric(threshold=0.7, model="gpt-4o")
# Evaluate whether agent completed the task
dataset = EvaluationDataset(goldens=[
Golden(input="Book the cheapest flight from NYC to LA for tomorrow")
])
for golden in dataset.evals_iterator(metrics=[task_completion]):
travel_agent(golden.input)
When to use it: Use TaskCompletionMetric as a top-level success indicator for any agent. It answers the fundamental question: did the agent accomplish its goal?
How it's calculated:
The metric extracts the task (either user-provided or inferred from the trace) and the outcome, then uses an LLM to evaluate alignment. A score of 1 means complete task fulfillment; lower scores indicate partial or failed completion.
→ Full Task Completion documentation
Step Efficiency Metric
The StepEfficiencyMetric evaluates whether your agent completes tasks without unnecessary steps. An agent might complete a task but waste tokens, time, and resources on redundant or circuitous actions.
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.metrics import StepEfficiencyMetric
@observe(type="tool")
def search_flights(origin, destination, date):
return [{"id": "FL123", "price": 450}, {"id": "FL456", "price": 380}]
@observe(type="tool")
def book_flight(flight_id):
return {"confirmation": "CONF-789"}
@observe(type="agent")
def inefficient_agent(user_input):
# Inefficient: searches twice unnecessarily
flights1 = search_flights("NYC", "LA", "2025-03-15")
flights2 = search_flights("NYC", "LA", "2025-03-15") # Redundant!
cheapest = min(flights1, key=lambda x: x["price"])
booking = book_flight(cheapest["id"])
return f"Booked: {booking['confirmation']}"
# Initialize metric
step_efficiency = StepEfficiencyMetric(threshold=0.7, model="gpt-4o")
# Evaluate - metric will penalize the redundant search_flights call
dataset = EvaluationDataset(goldens=[
Golden(input="Book the cheapest flight from NYC to LA")
])
for golden in dataset.evals_iterator(metrics=[step_efficiency]):
inefficient_agent(golden.input)
When to use it: Use StepEfficiencyMetric alongside TaskCompletionMetric to ensure your agent isn't just successful but also efficient. This is critical for production agents where token costs and latency matter.
How it's calculated:
The metric extracts the task and all execution steps from the trace, then uses an LLM to evaluate efficiency. It penalizes redundant tool calls, unnecessary reasoning loops, and any actions not strictly required to complete the task.
A high TaskCompletionMetric score with a low StepEfficiencyMetric score indicates your agent works but needs optimization. Focus on reducing unnecessary steps without sacrificing success rate.
→ Full Step Efficiency documentation
Putting It All Together
Here's a complete example showing how to use AI agent evaluation metrics across all three layers:
from deepeval.tracing import observe, update_current_span
from deepeval.dataset import Golden, EvaluationDataset, get_current_golden
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import (
TaskCompletionMetric,
StepEfficiencyMetric,
PlanQualityMetric,
PlanAdherenceMetric,
ToolCorrectnessMetric,
ArgumentCorrectnessMetric
)
# End-to-end metrics (analyze full agent trace)
task_completion = TaskCompletionMetric()
step_efficiency = StepEfficiencyMetric()
plan_quality = PlanQualityMetric()
plan_adherence = PlanAdherenceMetric()
# Component-level metrics (analyze specific components)
tool_correctness = ToolCorrectnessMetric()
argument_correctness = ArgumentCorrectnessMetric()
# Define tools
@observe(type="tool")
def search_flights(origin, destination, date):
return [{"id": "FL123", "price": 450}, {"id": "FL456", "price": 380}]
@observe(type="tool")
def book_flight(flight_id):
return {"confirmation": "CONF-789", "flight_id": flight_id}
# Attach component-level metrics to the LLM component
@observe(type="llm", metrics=[tool_correctness, argument_correctness])
def call_llm(user_input):
# LLM decides to search flights then book
origin, destination, date = "NYC", "Paris", "2025-03-18"
flights = search_flights(origin, destination, date)
cheapest = min(flights, key=lambda x: x["price"])
booking = book_flight(cheapest["id"])
# Update span with tool info for component-level evaluation
update_current_span(
input=user_input,
output=f"Booked {cheapest['id']}",
expected_tools=get_current_golden().expected_tools
)
return booking
@observe(type="agent")
def travel_agent(user_input):
booking = call_llm(user_input)
return f"Flight booked! Confirmation: {booking['confirmation']}"
# Create evaluation dataset
dataset = EvaluationDataset(goldens=[
Golden(input="Book a flight from NYC to Paris for next Tuesday", expected_tools=[ToolCall(name="search_flights"), ToolCall(name="book_flight")])
])
# Run evaluation with end-to-end metrics
for golden in dataset.evals_iterator(
metrics=[task_completion, step_efficiency, plan_quality, plan_adherence]
):
travel_agent(golden.input)
Choosing the Right AI Agent Evaluation Metrics
Not every agent needs every metric. Here's a decision framework:
| If Your Agent... | Prioritize These Metrics |
|---|---|
| Uses explicit planning/reasoning | PlanQualityMetric, PlanAdherenceMetric |
| Calls multiple tools | ToolCorrectnessMetric, ArgumentCorrectnessMetric |
| Has complex multi-step workflows | StepEfficiencyMetric, TaskCompletionMetric |
| Runs in production (cost-sensitive) | StepEfficiencyMetric |
| Is task-critical (must succeed) | TaskCompletionMetric |
All AI agent evaluation metrics in deepeval support custom LLM judges, configurable thresholds, strict mode for binary scoring, and detailed reasoning explanations. See each metric's documentation for full configuration options.
Next Steps
Now that you understand the available AI agent evaluation metrics, here's where to go next:
- Set up tracing — Required for all agent metrics to capture execution traces
- AI Agent Evaluation Guide — Deep dive into evaluation strategies for development and production
- End-to-end Evals — Learn how to run metrics on full agent traces
- Component-level Evals — Learn how to attach metrics to specific components