AI Agent Evaluation

Learn how to evaluate AI Agents using deepeval, including multi-agent systems and tool-using agents.

Overview

AI agent evaluation is different from other types of evals because agentic workflows are complex and consist of multiple interacting components, such as tools, chained LLM calls, and RAG modules. Therefore, it’s important to evaluate your AI agents both end-to-end and at the component level to understand how each part performs.

In this 5 min quickstart, you'll learn how to:

Set up LLM tracing for your agent
Evaluate your agent end-to-end
Evaluate individual components in your agent

Prerequisites

Install deepeval
A Confident AI API key (recommended). Sign up for one here.

info

Confident AI allows you to view and share your evaluation traces. Set your API key in the CLI:

CONFIDENT_API_KEY="confident_us..."

Setup LLM Tracing

In LLM tracing, a trace represents an end-to-end system interaction, whereas spans represents individual components in your agent. One or more spans make up a trace.

Choose your implementation

Python
LangGraph
LangChain
CrewAI

Attach the @observe decorator to functions/methods that make up your agent. These will represent individual components in your agent.

main.py
from deepeval.tracing import observe

@observe()
def your_ai_agent_tool():
    return 'tool call result'

@observe()
def your_ai_agent(input):
    tool_call_result = your_ai_agent_tool()
    return 'Tool Call Result: ' + tool_call_result

your_ai_agent("Greetings, AI Agent.")

Pass in deepeval's CallbackHandler for LangGraph to your agent's invoke method.

main.py
from langgraph.prebuilt import create_react_agent
from deepeval.integrations.langchain import CallbackHandler

def get_weather(city: str) -> str:
    """Returns the weather in a city"""
    return f"It's always sunny in {city}!"

agent = create_react_agent(
    model="openai:gpt-4.1",
    tools=[get_weather],
    prompt="You are a helpful assistant"
)

result = agent.invoke(
    input={"messages":[{"role":"user","content":"what is the weather in sf"}]},
    config={"callbacks":[CallbackHandler()]}
)

Pass in deepeval's CallbackHandler for LangChain to your agent's invoke method.

main.py
from langchain.chat_models import init_chat_model
from deepeval.integrations.langchain import CallbackHandler

def multiply(a: int, b: int) -> int:
    return a * b

llm = init_chat_model("gpt-4.1", model_provider="openai")
llm_with_tools = llm.bind_tools([multiply])

llm_with_tools.invoke(
    "What is 3 * 12?",
    config = {"callbacks": [CallbackHandler()]}
)

Instrument deepeval as an span handler for CrewAI and import your Agent from deepeval instead.

main.py
from crewai import Task, Crew, Agent
from deepeval.integrations.crewai import instrument_crewai

instrument_crewai()

coder = Agent(
    role='Consultant',
    goal='Write clear, concise explanation.',
    backstory='An expert consultant with a keen eye for software trends.',
)

task = Task(
    description="Explain the latest trends in AI.",
    agent=coder,
    expected_output="A clear and concise explanation.",
)

crew = Crew(agents=[coder], tasks=[task])
result = crew.kickoff()

Configure environment variables

This will prevent traces from being lost in case of an early program termination.

export CONFIDENT_TRACE_FLUSH=YES

Invoke your agent

Run your agent as you would normally do:

python main.py

✅ Done. You should see a trace log like the one below in your CLI if you're logged in to Confident AI:

[Confident AI Trace Log]  Successfully posted trace (...): https://app.confident.ai/[...]

Evaluate Your Agent End-to-End

An end-to-end evaluation means your agent will be treated as a black-box, where all that matters is the degree of task completion for a particular trace.

note

deepeval provides a wide selection of LLM models that you can easily choose from and run evaluations with.

from deepeval.metrics import TaskCompletionMetric

task_completion_metric = TaskCompletionMetric(model="gpt-4.1")

from deepeval.metrics import TaskCompletionMetric
from deepeval.models import AnthropicModel

model = AnthropicModel("claude-3-7-sonnet-latest")
task_completion_metric = TaskCompletionMetric(model=model)

from deepeval.metrics import TaskCompletionMetric
from deepeval.models import GeminiModel

model = GeminiModel("gemini-2.5-flash")
task_completion_metric = TaskCompletionMetric(model=model)

from deepeval.metrics import TaskCompletionMetric
from deepeval.models import OllamaModel

model = OllamaModel("deepseek-r1")
task_completion_metric = TaskCompletionMetric(model=model)

from deepeval.metrics import TaskCompletionMetric
from deepeval.models import GrokModel

model = GrokModel("grok-4-0709")
task_completion_metric = TaskCompletionMetric(model=model)

from deepeval.metrics import TaskCompletionMetric
from deepeval.models import AzureOpenAIModel

model = AzureOpenAIModel(
    model_name="gpt-4.1",
    deployment_name="Test Deployment",
    azure_openai_api_key="Your Azure OpenAI API Key",
    openai_api_version="2025-01-01-preview",
    azure_endpoint="https://example-resource.azure.openai.com/",
    temperature=0
)
task_completion_metric = TaskCompletionMetric(model=model)

from deepeval.metrics import TaskCompletionMetric
from deepeval.models import AmazonBedrockModel

model = AmazonBedrockModel(
    model_id="anthropic.claude-3-opus-20240229-v1:0",
    temperature=0
)
task_completion_metric = TaskCompletionMetric(model=model)

from deepeval.metrics import TaskCompletionMetric
from deepeval.models import GeminiModel

model = GeminiModel(
    model_name="gemini-1.5-pro",
    project="Your Project ID",
    location="us-central1",
    temperature=0
)
task_completion_metric = TaskCompletionMetric(model=model)

Configure evaluation model

To configure OpenAI as the your evaluation model for all metrics, set your OPENAI_API_KEY in the CLI:

export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>

You can also use these models for evaluation: Ollama, Azure OpenAI, Anthropic, Gemini, etc. To use ANY custom LLM of your choice, check out this part of the docs.

Setup task completion metric

Task Completion is the most powerful metric on deepeval for evaluating AI agents end-to-end.

from deepeval.metrics import TaskCompletionMetric

task_completion_metric = TaskCompletionMetric()

What other metrics are available?

Other metrics on deepeval can also be used to evaluate agents but ONLY if you run component-level evaluations, since they require you to set up an LLM test case. These metrics include:

For more information on available metrics, see the Metrics Introduction section.

tip

The task completion metric is an llm-judge metric and works by analyzing traces to determine the task at hand and the degree of completion of said task.

Run an evaluation

Use the dataset iterator to invoke your agent with a list of goldens. You will need to:

Create a dataset of goldens
Loop through your dataset, calling your agent in each iteration with the task completion metric set

This will benchmark your agent for this point-in-time and create a test run.

Python
LangGraph
LangChain
CrewAI

Supply the task completion metric to the metrics argument inside the @observe decorator.

main.py
from deepeval.tracing import observe
from deepeval.dataset import EvaluationDataset, Golden
...

@observe()
def your_ai_agent_tool():
    return 'tool call result'

# Supply task completion
@observe(metrics=[task_completion_metric])
def your_ai_agent(input):
    tool_call_result = your_ai_agent_tool()
    return 'Tool Call Result: ' + tool_call_result

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="This is a test query")])

# Loop through dataset
for golden in dataset.evals_iterator():
    your_ai_agent(golden.input)

Supply the task completion metric to the metrics argument inside the CallbackHandler.

main.py
from deepeval.integrations.langchain import CallbackHandler
from langgraph.prebuilt import create_react_agent
from deepeval.dataset import EvaluationDataset, Golden
...

def get_weather(city: str) -> str:
    """Returns the weather in a city"""
    return f"It's always sunny in {city}!"

agent = create_react_agent(
    model="openai:gpt-4.1",
    tools=[get_weather],
    prompt="You are a helpful assistant",
)

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="Explain the latest trends in AI.")])

# Loop through dataset
for golden in dataset.evals_iterator():
    agent.invoke(
        input={"messages": [{"role": "user", "content": golden.input}]},
        # Supply task completion
        config={"callbacks": [CallbackHandler(metrics=[task_completion_metric])]},
    )

Supply the task completion metric to the metrics argument inside the CallbackHandler.

main.py
from langchain.chat_models import init_chat_model
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
...

def multiply(a: int, b: int) -> int:
    return a * b

llm = init_chat_model("gpt-4.1", model_provider="openai")
llm_with_tools = llm.bind_tools([multiply])

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="Explain the latest trends in AI.")])

# Loop through dataset
for golden in dataset.evals_iterator():
    llm_with_tools.invoke(
        "What is 3 * 12?",
        # Supply task completion
        config={"callbacks": [CallbackHandler(metrics=[task_completion_metric])]},
    )

Supply the answer relevancy metric to the metrics argument inside the trace context manager from deepeval.

main.py
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric

dataset = EvaluationDataset(
    goldens=[
        Golden(input="London"),
        Golden(input="Paris"),
    ]
)

answer_relavancy_metric = AnswerRelevancyMetric()

for golden in dataset.evals_iterator():
    with trace(trace_metrics=[answer_relavancy_metric]):
        crew.kickoff({"city": golden.input})

Finally run main.py:

python main.py

🎉🥳 Congratulations! You've just ran your first agentic evals. Here's what happened:

When you call dataset.evals_iterator(), deepeval starts a "test run"
As you loop through your dataset, deepeval collects your agents' LLM traces and runs task completion on them
Each task completion metric will be ran once per loop, creating a test case

In the end, you will have the same number of test cases in your test run as goldens in the dataset you ran evals with.

View on Confident AI (recommended)

If you've set your CONFIDENT_API_KEY, test runs will appear automatically on Confident AI, the DeepEval platform.

Python
LangGraph
LangChain
CrewAI

tip

If you haven't logged in, you can still upload the test run to Confident AI from local cache:

deepeval view

Evaluate Agentic Components

Component-level evaluations treats your agent as a white box, allowing you to isolate and evaluate the performance of individual spans in your agent.

caution

Component-level evaluation integrations are not available on deepeval at this time. If you're building with an LLM framework, manually trace your agent with @observe decorators to evaluate agentic components.

Define metrics

Any single-turn metric can be used to evaluate agentic components.

from deepeval.metrics import TaskCompletionMetric, ArgumentCorrectnessMetric

arg_correctness_metric = ArgumentCorrectnessMetric()
task_completion_metric = TaskCompletionMetric()

Setup test cases & metrics

Supply the metrics to the @observe decorator of each function, then define a test case in update_span if needed. The test case should include every parameter required by the metrics you select.

main.py
from openai import OpenAI
import json
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.tracing import observe, update_current_span
...

client = OpenAI()
tools = [...]

@observe()
def web_search_tool(web_query):
    return "Web search results"

# Supply metric
@observe(metrics=[arg_correctness_metric])
def llm_component(query):
    response = client.responses.create(model="gpt-4.1", input=[{"role": "user", "content": query}], tools=tools)

    # Format tools
    tools_called = [ToolCall(name=tool_call.name, arguments=tool_call.arguments) for tool_call in response.output if tool_call.type == "function_call"]

    # Create test cases on the component-level
    update_current_span(
        test_case=LLMTestCase(input=query, actual_output=response.output_text, tools_called=tools_called)
    )
    return response.output

# Supply metric
@observe(metrics=[task_completion_metric])
def your_ai_agent(query: str) -> str:
    llm_output = llm_component(query)
    search_results = "".join([web_search_tool(**json.loads(tool_call.arguments)) for tool_call in llm_output if tool_call == "function_call"])
    return "The answer to your question is: " + search_results

Click to see a detailed explanation of the code example above

your_ai_agent is an AI agent that can answer any user query by searching the web for information.

It does so by invoking llm, which calls the LLM using OpenAI’s Responses API. The LLM can decide to either produce a direct response to the user query or call web_search_tool to perform a web search.

info

Although tools=[...] is condensed in the example below, it must be defined in the following format before being passed to OpenAI’s client.responses.create method.

tools = [{
    "type": "function",
    "name": "web_search_tool",
    "description": "Search the web for information.",
    "parameters": {
        "type": "object",
        "properties": {
            "web_query": {"type": "string"}
        },
        "required": ["web_query"],
        "additionalProperties": False
    },
    "strict": True
}]

In the example below, Task Completion is used to evaluate the performance of the your_ai_agent function, while Argument Correctness is used to evaluate llm.

This is because while Argument Correctness requires setting up a test case with the input, actual output, and tools called, Task Completion is the only metric on DeepEval that doesn't require a test case.

Run an evaluation

Similar to end-to-end evals, the dataset iterator to invoke your agent with a list of goldens. You will need to:

Create a dataset of goldens
Loop through your dataset, calling your agent in each iteration with the task completion metric set

This will benchmark your agent for this point-in-time and create a test run.

main.py
from deepeval.dataset import EvaluationDataset, Golden
...

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input='What is component-level evals?')])

# Loop through dataset
for golden in dataset.evals_iterator():
    your_ai_agent(golden.input)

Finally run main.py:

python main.py

✅ Done. Similar to end-to-end evals, the evals_iterator() creates a test run out of your dataset, with the only difference being deepeval will evaluate and create test cases out of individual components you've defined in your agent instead.

Next Steps

Now that you have run your first agentic evals, you should:

Customize your metrics: Update the list of metrics for each component.
Customize tracing: It helps benchmark and identify different components on the UI.
Enable evals in production: Just replace metrics in @observe with a metric_collection string on Confident AI.

You'll be able to analyze performance over time on traces (end-to-end) and spans (component-level).

End-to-end (traces) in prod
Component-level (spans) in prod

Evals on traces are end-to-end evaluations, where a single LLM interaction is being evaluated.

Trace-Level Evals in Production

Overview​

Prerequisites​

Setup LLM Tracing​

Choose your implementation

Configure environment variables

Invoke your agent

Evaluate Your Agent End-to-End​

Configure evaluation model

Setup task completion metric

Run an evaluation

View on Confident AI (recommended)

Evaluate Agentic Components​

Define metrics

Setup test cases & metrics

Run an evaluation

Next Steps​

Overview

Prerequisites

Setup LLM Tracing

Evaluate Your Agent End-to-End

Evaluate Agentic Components

Next Steps