🔥 Vibe coding for DeepEval is here. Get started now.
Use Cases

AI Agent Evaluation Quickstart

Learn how to evaluate AI Agents using deepeval, including multi-agent systems and tool-using agents.

Overview

AI agent evaluation is different from other types of evals because agentic workflows are complex and consist of multiple interacting components, such as tools, chained LLM calls, and RAG modules. Therefore, it’s important to evaluate your AI agents both end-to-end and at the component level to understand how each part performs.

In this 5 min quickstart, you'll learn how to:

  • Set up LLM tracing for your agent
  • Evaluate your agent end-to-end
  • Evaluate individual components in your agent

Prerequisites

  • Install deepeval
  • A Confident AI API key (recommended). Sign up for one here.

Setup LLM Tracing

In LLM tracing, a trace represents an end-to-end system interaction, whereas spans represents individual components in your agent. One or more spans make up a trace.

Choose your implementation

Attach the @observe decorator to functions/methods that make up your agent. These will represent individual components in your agent.

from deepeval.tracing import observe

@observe()
def your_ai_agent_tool():
    return 'tool call result'

@observe()
def your_ai_agent(input):
    tool_call_result = your_ai_agent_tool()
    return 'Tool Call Result: ' + tool_call_result

your_ai_agent("Greetings, AI Agent.")

Pass in deepeval's CallbackHandler for LangGraph to your agent's invoke method.

from langgraph.prebuilt import create_react_agent
from deepeval.integrations.langchain import CallbackHandler

def get_weather(city: str) -> str:
    """Returns the weather in a city"""
    return f"It's always sunny in {city}!"

agent = create_react_agent(
    model="openai:gpt-4.1",
    tools=[get_weather],
    prompt="You are a helpful assistant",
)

agent.invoke(
    input={"messages": [{"role": "user", "content": "what is the weather in sf"}]},
    config={"callbacks": [CallbackHandler()]},
)

Pass in deepeval's CallbackHandler for LangChain to your agent's invoke method.

from langchain.chat_models import init_chat_model
from deepeval.integrations.langchain import CallbackHandler

def multiply(a: int, b: int) -> int:
    return a * b

llm = init_chat_model("gpt-4.1", model_provider="openai")
llm_with_tools = llm.bind_tools([multiply])

llm_with_tools.invoke(
    "What is 3 * 12?",
    config={"callbacks": [CallbackHandler()]},
)

Call instrument_crewai() once, then build your crew with deepeval's Crew, Agent, and @tool shims.

from crewai import Task
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent

instrument_crewai()

coder = Agent(
    role="Consultant",
    goal="Write a clear, concise explanation.",
    backstory="An expert consultant with a keen eye for software trends.",
)

task = Task(
    description="Explain the latest trends in AI.",
    agent=coder,
    expected_output="A clear and concise explanation.",
)

crew = Crew(agents=[coder], tasks=[task])
crew.kickoff()

Register deepeval's event handler against LlamaIndex's instrumentation dispatcher.

import asyncio
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument

from deepeval.integrations.llama_index import instrument_llama_index

instrument_llama_index(instrument.get_dispatcher())

def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful calculator.",
)

asyncio.run(agent.run("What is 8 multiplied by 6?"))

Pass DeepEvalInstrumentationSettings() to your Agent's instrument keyword.

from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings

agent = Agent(
    "openai:gpt-4.1",
    system_prompt="Be concise.",
    instrument=DeepEvalInstrumentationSettings(),
)

agent.run_sync("Greetings, AI Agent.")

Register DeepEvalTracingProcessor once, then build your agent with deepeval's Agent and function_tool shims.

from agents import Runner, add_trace_processor
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool

add_trace_processor(DeepEvalTracingProcessor())

@function_tool
def get_weather(city: str) -> str:
    """Returns the weather in a city."""
    return f"It's always sunny in {city}!"

agent = Agent(
    name="weather_agent",
    instructions="Answer weather questions concisely.",
    tools=[get_weather],
)

Runner.run_sync(agent, "What's the weather in Paris?")

Call instrument_google_adk() once before building your LlmAgent.

import asyncio
from google.adk.agents import LlmAgent
from google.adk.runners import InMemoryRunner
from google.genai import types

from deepeval.integrations.google_adk import instrument_google_adk

instrument_google_adk()

agent = LlmAgent(model="gemini-2.0-flash", name="assistant", instruction="Be concise.")
runner = InMemoryRunner(agent=agent, app_name="deepeval-quickstart")

async def run_agent(prompt: str) -> str:
    session = await runner.session_service.create_session(
        app_name="deepeval-quickstart", user_id="demo-user"
    )
    message = types.Content(role="user", parts=[types.Part(text=prompt)])
    async for event in runner.run_async(
        user_id="demo-user", session_id=session.id, new_message=message
    ):
        if event.is_final_response() and event.content:
            return "".join(p.text for p in event.content.parts if getattr(p, "text", None))
    return ""

asyncio.run(run_agent("What is 7 multiplied by 8?"))

Configure environment variables

This will prevent traces from being lost in case of an early program termination.

export CONFIDENT_TRACE_FLUSH=1

Invoke your agent

Run your agent as you would normally do:

python main.py

✅ Done. You should see a trace log like the one below in your CLI if you're logged in to Confident AI:

[Confident AI Trace Log]

Successfully posted trace (...):

https://app.confident.ai/[...]

Evaluate Your Agent End-to-End

An end-to-end evaluation means your agent will be treated as a black-box, where all that matters is the degree of task completion for a particular trace.

Configure evaluation model

To configure OpenAI as the your evaluation model for all metrics, set your OPENAI_API_KEY in the CLI:

export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>

You can also use these models for evaluation: Ollama, Azure OpenAI, Anthropic, Gemini, etc. To use ANY custom LLM of your choice, check out this part of the docs.

Setup task completion metric

Task Completion is the most powerful metric on deepeval for evaluating AI agents end-to-end.

from deepeval.metrics import TaskCompletionMetric

task_completion_metric = TaskCompletionMetric()
What other metrics are available?

Other metrics on deepeval can also be used to evaluate agents but ONLY if you run component-level evaluations, since they require you to set up an LLM test case. These metrics include:

For more information on available metrics, see the Metrics Introduction section.

Run an evaluation

Use the dataset iterator to invoke your agent with a list of goldens. You will need to:

  1. Create a dataset of goldens
  2. Loop through your dataset, calling your agent in each iteration with the task completion metric set

This will benchmark your agent for this point-in-time and create a test run.

Supply the task completion metric to the metrics argument of @observe.

from deepeval.tracing import observe
from deepeval.dataset import EvaluationDataset, Golden
...

@observe()
def your_ai_agent_tool():
    return 'tool call result'

# Supply task completion
@observe(metrics=[task_completion_metric])
def your_ai_agent(input):
    tool_call_result = your_ai_agent_tool()
    return 'Tool Call Result: ' + tool_call_result

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="This is a test query")])

# Loop through dataset
for golden in dataset.evals_iterator():
    your_ai_agent(golden.input)

Supply the task completion metric to the metrics argument of CallbackHandler.

from deepeval.integrations.langchain import CallbackHandler
from langgraph.prebuilt import create_react_agent
from deepeval.dataset import EvaluationDataset, Golden
...

def get_weather(city: str) -> str:
    """Returns the weather in a city"""
    return f"It's always sunny in {city}!"

agent = create_react_agent(
    model="openai:gpt-4.1",
    tools=[get_weather],
    prompt="You are a helpful assistant",
)

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="What is the weather in Paris?")])

# Loop through dataset
for golden in dataset.evals_iterator():
    agent.invoke(
        input={"messages": [{"role": "user", "content": golden.input}]},
        # Supply task completion
        config={"callbacks": [CallbackHandler(metrics=[task_completion_metric])]},
    )

Supply the task completion metric to the metrics argument of CallbackHandler.

from langchain.chat_models import init_chat_model
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
...

def multiply(a: int, b: int) -> int:
    return a * b

llm = init_chat_model("gpt-4.1", model_provider="openai")
llm_with_tools = llm.bind_tools([multiply])

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="What is 3 * 12?")])

# Loop through dataset
for golden in dataset.evals_iterator():
    llm_with_tools.invoke(
        golden.input,
        # Supply task completion
        config={"callbacks": [CallbackHandler(metrics=[task_completion_metric])]},
    )

Supply the task completion metric to the metrics argument of deepeval's Agent shim.

from crewai import Task
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent
from deepeval.dataset import EvaluationDataset, Golden
...

instrument_crewai()

coder = Agent(
    role="Consultant",
    goal="Write a clear, concise explanation.",
    backstory="An expert consultant with a keen eye for software trends.",
    # Supply task completion
    metrics=[task_completion_metric],
)
task = Task(
    description="Explain {topic}.",
    agent=coder,
    expected_output="A clear and concise explanation.",
)
crew = Crew(agents=[coder], tasks=[task])

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="the latest trends in AI")])

# Loop through dataset
for golden in dataset.evals_iterator():
    crew.kickoff({"topic": golden.input})

Supply the task completion metric to AgentSpanContext and pass it via with trace(...).

import asyncio
from deepeval.tracing import trace, AgentSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
...

# Reuse the agent and instrument_llama_index(...) from setup
async def run_agent(prompt: str):
    # Supply task completion
    with trace(agent_span_context=AgentSpanContext(metrics=[task_completion_metric])):
        return await agent.run(prompt)

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="What is 8 multiplied by 6?")])

# Loop through dataset
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

Supply the task completion metric to evals_iterator(metrics=[...]) to score the trace end-to-end.

from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.dataset import EvaluationDataset, Golden
...

agent = Agent(
    "openai:gpt-4.1",
    system_prompt="Be concise.",
    instrument=DeepEvalInstrumentationSettings(),
)

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="What's the capital of France?")])

# Loop through dataset
for golden in dataset.evals_iterator(metrics=[task_completion_metric]):
    agent.run_sync(golden.input)

Supply the task completion metric to the agent_metrics argument of deepeval's Agent shim.

from agents import Runner, add_trace_processor
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool
from deepeval.dataset import EvaluationDataset, Golden
...

add_trace_processor(DeepEvalTracingProcessor())

@function_tool
def get_weather(city: str) -> str:
    """Returns the weather in a city."""
    return f"It's always sunny in {city}!"

agent = Agent(
    name="weather_agent",
    instructions="Answer weather questions concisely.",
    tools=[get_weather],
    # Supply task completion
    agent_metrics=[task_completion_metric],
)

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather in Paris?")])

# Loop through dataset
for golden in dataset.evals_iterator():
    Runner.run_sync(agent, golden.input)

Supply the task completion metric to evals_iterator(metrics=[...]) to score the trace end-to-end.

import asyncio
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
...

# Reuse the agent and run_agent(...) from setup
# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="What is 7 multiplied by 8?")])

# Loop through dataset
for golden in dataset.evals_iterator(
    async_config=AsyncConfig(run_async=True),
    # Supply task completion
    metrics=[task_completion_metric],
):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

Finally run main.py:

python main.py

🎉🥳 Congratulations! You've just ran your first agentic evals. Here's what happened:

  • When you call dataset.evals_iterator(), deepeval starts a "test run"
  • As you loop through your dataset, deepeval collects your agents' LLM traces and runs task completion on them
  • Each task completion metric will be ran once per loop, creating a test case

In the end, you will have the same number of test cases in your test run as goldens in the dataset you ran evals with.

If you've set your CONFIDENT_API_KEY, test runs will appear automatically on Confident AI, which deepeval integrates with natively. The flow is the same across every integration; the videos below show four representative frameworks.

Evaluate Agentic Components

Component-level evaluations treats your agent as a white box, allowing you to isolate and evaluate the performance of individual spans in your agent.

Define metrics

Any single-turn metric can be used to evaluate agentic components.

from deepeval.metrics import TaskCompletionMetric, ArgumentCorrectnessMetric

arg_correctness_metric = ArgumentCorrectnessMetric()
task_completion_metric = TaskCompletionMetric()

Setup test cases & metrics

Supply the metrics to the @observe decorator of each function, then define a test case in update_span if needed. The test case should include every parameter required by the metrics you select.

from openai import OpenAI
import json

from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.tracing import observe, update_current_span
...

client = OpenAI()
tools = [...]

@observe()
def web_search_tool(web_query):
    return "Web search results"

# Supply metric
@observe(metrics=[arg_correctness_metric])
def llm_component(query):
    response = client.responses.create(model="gpt-4.1", input=[{"role": "user", "content": query}], tools=tools)

    # Format tools
    tools_called = [ToolCall(name=tool_call.name, arguments=tool_call.arguments) for tool_call in response.output if tool_call.type == "function_call"]

    # Create test cases on the component-level
    update_current_span(
        test_case=LLMTestCase(input=query, actual_output=response.output_text, tools_called=tools_called)
    )
    return response.output

# Supply metric
@observe(metrics=[task_completion_metric])
def your_ai_agent(query: str) -> str:
    llm_output = llm_component(query)
    search_results = "".join([web_search_tool(**json.loads(tool_call.arguments)) for tool_call in llm_output if tool_call == "function_call"])
    return "The answer to your question is: " + search_results
Click to see a detailed explanation of the code example above

your_ai_agent is an AI agent that can answer any user query by searching the web for information.

It does so by invoking llm, which calls the LLM using OpenAI’s Responses API. The LLM can decide to either produce a direct response to the user query or call web_search_tool to perform a web search.

In the example below, Task Completion is used to evaluate the performance of the your_ai_agent function, while Argument Correctness is used to evaluate llm.

This is because while Argument Correctness requires setting up a test case with the input, actual output, and tools called, Task Completion is the only metric on deepeval that doesn't require a test case.

Run an evaluation

Similar to end-to-end evals, the dataset iterator to invoke your agent with a list of goldens. You will need to:

  1. Create a dataset of goldens
  2. Loop through your dataset, calling your agent in each iteration with the task completion metric set

This will benchmark your agent for this point-in-time and create a test run.

from deepeval.dataset import EvaluationDataset, Golden
...

# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input='What is component-level evals?')])

# Loop through dataset
for golden in dataset.evals_iterator():
    your_ai_agent(golden.input)

Finally run main.py:

python main.py

✅ Done. Similar to end-to-end evals, the evals_iterator() creates a test run out of your dataset, with the only difference being deepeval will evaluate and create test cases out of individual components you've defined in your agent instead.

Next Steps

Now that you have run your first agentic evals, you should:

  1. Customize your metrics: Update the list of metrics for each component.
  2. Customize tracing: It helps benchmark and identify different components on the UI.
  3. Explore the integration docs: Each framework integration has its own page with end-to-end and component-level patterns.

You'll be able to analyze performance over time on traces (end-to-end) and spans (component-level).

Evals on traces are end-to-end evaluations, where a single LLM interaction is being evaluated.

Trace-Level Evals in Production

Spans make up a trace and evals on spans represents component-level evaluations, where individual components in your LLM app are being evaluated.

Span-Level Evals in Production

On this page