Skip to main content

AI Agent Evaluation

Learn how to evaluate AI Agents using deepeval, including multi-agent systems and tool-using agents.

Overview

AI agent evaluation is different from other types of evals because agentic workflows are complex and consist of multiple interacting components, such as tools, chained LLM calls, and RAG modules. Therefore, it’s important to evaluate your AI agents both end-to-end and at the component level to understand how each part performs.

In this 5 min quickstart, you'll learn how to:

  • Set up LLM tracing for your agent
  • Evaluate your agent end-to-end
  • Evaluate individual components in your agent

Prerequisites

  • Install deepeval
  • A Confident AI API key (recommended). Sign up for one here.
info

Confident AI allows you to view and share your evaluation traces. Set your API key in the CLI:

CONFIDENT_API_KEY="confident_us..."

Setup LLM Tracing

In LLM tracing, a trace represents an end-to-end system interaction, whereas spans represents individual components in your agent. One or more spans make up a trace.

Choose your implementation

Attach the @observe decorator to functions/methods that make up your agent. These will represent individual components in your agent.

main.py
from deepeval.tracing import observe

@observe()
def your_ai_agent_tool():
return 'tool call result'

@observe()
def your_ai_agent(input):
tool_call_result = your_ai_agent_tool()
return 'Tool Call Result: ' + tool_call_result

Configure environment variables

This will prevent traces from being lost in case of an early program termination.

export CONFIDENT_TRACE_FLUSH=YES

Invoke your agent

Run your agent as you would normally do:

python main.py

✅ Done. You should see a trace log like the one below in your CLI if you're logged in to Confident AI:

[Confident AI Trace Log]  Successfully posted trace (...): https://app.confident.ai/[...]

Evaluate Your Agent End-to-End

An end-to-end evaluation means your agent will be treated as a black-box, where all that matters is the degree of task completion for a particular trace.

Configure evaluation model

To configure OpenAI as the your evaluation model for all metrics, set your OPENAI_API_KEY in the CLI:

export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>

You can also use these models for evaluation: Ollama, Azure OpenAI, Anthropic, Gemini, etc. To use ANY custom LLM of your choice, check out this part of the docs.

Setup task completion metric

Task Completion is the most powerful metric on deepeval for evaluating AI agents end-to-end.

from deepeval.metrics import TaskCompletionMetric

task_completion_metric = TaskCompletionMetric()
What other metrics are available?

Other metrics on deepeval can also be used to evaluate agents but ONLY if you run component-level evaluations, since they require you to set up an LLM test case. These metrics include:

For more information on available metrics, see the Metrics Introduction section.

tip

The task completion metric works by analyzing traces to determine the task at hand and the degree of completion of said task.

Run an evaluation

Use the dataset iterator to invoke your agent with a list of goldens. You will need to:

  1. Create a dataset of goldens
  2. Loop through your dataset, calling your agent in each iteration with the task completion metric set

This will benchmark your agent for this point-in-time and create a test run.

Supply the task completion metric to the metrics argument inside the @observe decorator.

main.py
from deepeval.tracing import observe
from deepeval.dataset import EvaluationDataset, Golden
...

@observe()
def your_ai_agent_tool():
return 'tool call result'

# Supply task completion
@observe(metrics=[task_completion_metric])
def your_ai_agent(input):
tool_call_result = your_ai_agent_tool()
return 'Tool Call Result: ' + tool_call_result

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="This is a test query")])

# Loop through dataset
for golden in dataset.evals_iterator():
your_ai_agent(golden.input)

Finally run main.py:

python main.py

🎉🥳 Congratulations! You've just ran your first agentic evals. Here's what happened:

  • When you call dataset.evals_iterator(), deepeval starts a "test run"
  • As you loop through your dataset, deepeval collects your agents' LLM traces and runs task completion on them
  • Each task completion metric will be ran once per loop, creating a test case

In the end, you will have the same number of test cases in your test run as goldens in the dataset you ran evals with.

View on Confident AI (recommended)

If you've set your CONFIDENT_API_KEY, test runs will appear automatically on Confident AI, the DeepEval platform.

tip

If you haven't logged in, you can still upload the test run to Confident AI from local cache:

deepeval view

Evaluate Agentic Components

Component-level evaluations treats your agent as a white box, allowing you to isolate and evaluate the performance of individual spans in your agent.

caution

Component-level evaluation integrations are not available on deepeval at this time. If you're building with an LLM framework, manually trace your agent with @observe decorators to evaluate agentic components.

Define metrics

Any single-turn metric can be used to evaluate agentic components.

main.py
from deepeval.metrics import TaskCompletionMetric, ArgCorrectnessMetric

tool_correctness_metric = ArgCorrectnessMetric()
task_completion_metric = TaskCompletionMetric()

Setup test cases & metrics

Supply the metrics to the @observe decorator of each function, then define a test case in update_span if needed. The test case should include every parameter required by the metrics you select.

main.py
from openai import OpenAI
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.tracing import observe, update_span
...

client = OpenAI()
tools = [...]

@observe()
def web_search_tool(web_query):
return "Web search results"

# Supply metric
@observe(metrics=[arg_correctness_metric])
def llm(query):
response = client.responses.create(model="gpt-4.1", input=[{"role": "user", "content": query}], tools=tools)

# Format tools
tools_called = [ToolCall(name=tool_call.name, arguments=tool_call.arguments) for tool_call in response.output if tool_call.type == "function_call"]

# Create test cases on the component-level
update_span(
test_case=LLMTestCase(input=query, actual_output=str(response.output), tools_called=tool_called)
)
return response.output

# Supply metric
@observe(metrics=[task_completion_metric])
def your_ai_agent(query: str) -> str:
llm_output = llm(query)
search_results = "".join([web_search_tool(**json.loads(tool_call.arguments)) for tool_call in llm_output if tool_call.type == "function_call"])
return "The answer to your question is: " + search_results
Click to see a detailed explanation of the code example above

your_ai_agent is an AI agent that can answer any user query by searching the web for information.

It does so by invoking llm, which calls the LLM using OpenAI’s Responses API. The LLM can decide to either produce a direct response to the user query or call web_search_tool to perform a web search.

info

Although tools=[...] is condensed in the example below, it must be defined in the following format before being passed to OpenAI’s client.responses.create method.

tools = [{
"type": "function",
"name": "web_search_tool",
"description": "Search the web for information.",
"parameters": {
"type": "object",
"properties": {
"web_query": {"type": "string"}
},
"required": ["web_query"],
"additionalProperties": False
},
"strict": True
}]

In the example below, Task Completion is used to evaluate the performance of the your_ai_agent function, while Argument Correctness is used to evaluate llm.

This is because while Argument Correctness requires setting up a test case with the input, actual output, and tools called, Task Completion is the only metric on DeepEval that doesn't require a test case.

Run an evaluation

Similar to end-ot-end evals, the dataset iterator to invoke your agent with a list of goldens. You will need to:

  1. Create a dataset of goldens
  2. Loop through your dataset, calling your agent in each iteration with the task completion metric set

This will benchmark your agent for this point-in-time and create a test run.

main.py
from deepeval.dataset import EvaluationDataset, Golden
...

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input='What is component-level evals?')])

# Loop through dataset
for golden in dataset.evals_iterator():
your_ai_agent(golden.input)

Finally run main.py:

python main.py

✅ Done. Similar to end-to-end evals, the evals_iterator() creates a test run out of your dataset, with the only difference being deepeval will evaluate and create test cases out of individual components you've defined in your agent instead.

Next Steps

Now that you have run your first agentic evals, you should:

  1. Customize your metrics: Update the list of metrics for each component.
  2. Customize tracing: It helps benchmark and identify different components on the UI.
  3. Enable evals in production: Just replace metrics in @observe with a metric_collection string on Confident AI.

You'll be able to analyze performance over time on traces (end-to-end) and spans (component-level).

Evals on traces are end-to-end evaluations, where a single LLM interaction is being evaluated.

Trace-Level Evals in Production