🔥 DeepEval 4.0 just got released. Read the announcement.

Single-Turn End-to-End Evaluation

A single-turn end-to-end test scores one input → one output per LLM interaction, captured as an LLMTestCase. This is the right flavor for any LLM application with a "flat" shape — agents treated as a black box, RAG / QA, summarization, classifiers, writing assistants, and so on.

If you haven't already, read the end-to-end overview for the concepts and how single-turn compares to multi-turn.

There are two ways to run a single-turn E2E test:

ApproachWhen to choose it
dataset.evals_iterator() with @observe tracing — recommendedYour app is (or can be) instrumented with tracing. Test cases are built from traces automatically, and you get per-test-case traces on Confident AI for free.
evaluate(test_cases=...)You can't (or don't want to) instrument your app — e.g. a QA engineer evaluating a deployed system. You build LLMTestCases up front and hand them to evaluate().

For projects you own, prefer evals_iterator() — same code, plus traces, plus a clean upgrade path to component-level evaluation.

evals_iterator() opens a test run, yields each golden, builds an LLMTestCase from the captured trace, scores your metrics against it, and uploads the trace + scores together — all in one loop.

Build dataset

Datasets in deepeval store Goldens — precursors to test cases. You loop over goldens at evaluation time, run your traced LLM app on each, and deepeval builds an LLMTestCase from the resulting trace.

from deepeval.dataset import Golden, EvaluationDataset

goldens = [
    Golden(input="What is your name?"),
    Golden(input="Choose a number between 1 and 100"),
    # ...
]

dataset = EvaluationDataset(goldens=goldens)

The dataset lives only for this run — no push, no save. Perfect for quickstarts and one-off evaluations.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My dataset")
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
    file_path="example.csv",
    input_col_name="query",
)
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
    file_path="example.json",
    input_key_name="query",
)

Instrument/trace and evaluate

Instrument your AI agent based on your tech stack, then loop with evals_iterator(metrics=[...]) to score each captured trace as one end-to-end test case.

Each integration ships Async (default — fastest) and Sync variants:

  • Async keeps evals_iterator() on its default async dispatch and wraps each invocation in asyncio.create_task(...) + dataset.evaluate(task) so goldens run concurrently.
  • Sync passes AsyncConfig(run_async=False) and runs the loop body one golden at a time. Useful for debugging, rate-limited providers, or anywhere asyncio gets in the way (e.g. some Jupyter setups).

Wrap the top-level function with @observe and call update_current_trace(...) to set the trace-level test case fields:

main.py
import asyncio
from deepeval.tracing import observe, update_current_trace
from deepeval.metrics import TaskCompletionMetric
...

@observe()
async def my_ai_agent(query: str) -> str:
    answer = "..."  # await your LLM call here
    update_current_trace(input=query, output=answer)
    return answer

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(my_ai_agent(golden.input))
    dataset.evaluate(task)
main.py
from deepeval.evaluate import AsyncConfig
from deepeval.tracing import observe, update_current_trace
from deepeval.metrics import TaskCompletionMetric
...

@observe()
def my_ai_agent(query: str) -> str:
    answer = "..."  # call your LLM here
    update_current_trace(input=query, output=answer)
    return answer

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    my_ai_agent(golden.input)

See tracing for the full @observe and update_current_trace surface.

Build your agent with create_agent, then pass deepeval's CallbackHandler to its invoke / ainvoke method inside the loop:

langchain_app.py
import asyncio
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
...

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[multiply],
    system_prompt="Be concise.",
)

async def run_agent(prompt: str):
    return await agent.ainvoke(
        {"messages": [{"role": "user", "content": prompt}]},
        config={"callbacks": [CallbackHandler()]},
    )

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)
langchain_app.py
from langchain.agents import create_agent
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
...

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[multiply],
    system_prompt="Be concise.",
)

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )

See the LangChain integration for the full surface.

Wire your StateGraph, then pass deepeval's CallbackHandler to its invoke / ainvoke method inside the loop:

langgraph_app.py
import asyncio
from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, MessagesState, START, END
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
...

llm = init_chat_model("openai:gpt-4o-mini")

async def chatbot(state: MessagesState):
    return {"messages": [await llm.ainvoke(state["messages"])]}

graph = (
    StateGraph(MessagesState)
    .add_node(chatbot)
    .add_edge(START, "chatbot")
    .add_edge("chatbot", END)
    .compile()
)

async def run_graph(prompt: str):
    return await graph.ainvoke(
        {"messages": [{"role": "user", "content": prompt}]},
        config={"callbacks": [CallbackHandler()]},
    )

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(run_graph(golden.input))
    dataset.evaluate(task)
langgraph_app.py
from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, MessagesState, START, END
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import TaskCompletionMetric
...

llm = init_chat_model("openai:gpt-4o-mini")

def chatbot(state: MessagesState):
    return {"messages": [llm.invoke(state["messages"])]}

graph = (
    StateGraph(MessagesState)
    .add_node(chatbot)
    .add_edge(START, "chatbot")
    .add_edge("chatbot", END)
    .compile()
)

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    graph.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )

See the LangGraph integration for the full surface.

Drop-in replace from openai import OpenAI with from deepeval.openai import OpenAI (or AsyncOpenAI). Wrap the call in with trace(): so the LLM call becomes a trace:

openai_app.py
import asyncio
from deepeval.openai import AsyncOpenAI
from deepeval.tracing import trace
from deepeval.metrics import TaskCompletionMetric
...

client = AsyncOpenAI()

async def call_openai(prompt: str):
    with trace():
        return await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
        )

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(call_openai(golden.input))
    dataset.evaluate(task)
openai_app.py
from deepeval.openai import OpenAI
from deepeval.tracing import trace
from deepeval.evaluate import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
...

client = OpenAI()

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    with trace():
        client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": golden.input}],
        )

See the OpenAI integration for streaming and tool-calling.

Pass DeepEvalInstrumentationSettings() to your Agent's instrument keyword:

pydanticai_agent.py
import asyncio
from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.metrics import TaskCompletionMetric
...

agent = Agent(
    "openai:gpt-4.1",
    system_prompt="Be concise.",
    instrument=DeepEvalInstrumentationSettings(),
)

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(agent.run(golden.input))
    dataset.evaluate(task)
pydanticai_agent.py
from pydantic_ai import Agent
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.metrics import TaskCompletionMetric
...

agent = Agent(
    "openai:gpt-4.1",
    system_prompt="Be concise.",
    instrument=DeepEvalInstrumentationSettings(),
)

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    agent.run_sync(golden.input)

See the Pydantic AI integration for the full surface.

Call instrument_agentcore() before creating your agent. The same call also instruments Strands agents running inside AgentCore:

agentcore_agent.py
import asyncio
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.metrics import TaskCompletionMetric
...

instrument_agentcore()

agent = Agent(model="amazon.nova-lite-v1:0")

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(agent.invoke_async(golden.input))
    dataset.evaluate(task)
agentcore_agent.py
from strands import Agent
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.metrics import TaskCompletionMetric
...

instrument_agentcore()

agent = Agent(model="amazon.nova-lite-v1:0")

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    agent(golden.input)

See the AgentCore integration for the full surface (including the BedrockAgentCoreApp entrypoint pattern).

Call instrument_strands() before invoking your Strands agent (for AgentCore-hosted Strands, use the AgentCore tab instead):

strands_agent.py
import asyncio
from strands import Agent
from strands.models.openai import OpenAIModel
from deepeval.integrations.strands import instrument_strands
from deepeval.metrics import TaskCompletionMetric
...

instrument_strands()

agent = Agent(
    model=OpenAIModel(model_id="gpt-4o-mini"),
    system_prompt="You are a helpful assistant.",
)

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(agent.invoke_async(golden.input))
    dataset.evaluate(task)
strands_agent.py
from strands import Agent
from strands.models.openai import OpenAIModel
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.strands import instrument_strands
from deepeval.metrics import TaskCompletionMetric
...

instrument_strands()

agent = Agent(
    model=OpenAIModel(model_id="gpt-4o-mini"),
    system_prompt="You are a helpful assistant.",
)

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    agent(golden.input)

See the Strands integration for the full surface.

Drop-in replace from anthropic import Anthropic with from deepeval.anthropic import Anthropic (or AsyncAnthropic). Wrap the call in with trace(): so the LLM call becomes a trace:

anthropic_app.py
import asyncio
from deepeval.anthropic import AsyncAnthropic
from deepeval.tracing import trace
from deepeval.metrics import TaskCompletionMetric
...

client = AsyncAnthropic()

async def call_claude(prompt: str):
    with trace():
        return await client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(call_claude(golden.input))
    dataset.evaluate(task)
anthropic_app.py
from deepeval.anthropic import Anthropic
from deepeval.tracing import trace
from deepeval.evaluate import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
...

client = Anthropic()

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    with trace():
        client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": golden.input}],
        )

See the Anthropic integration for streaming and tool-use.

Register deepeval's event handler against LlamaIndex's instrumentation dispatcher. agent.run(...) is async-only, so the sync variant uses asyncio.run(...):

llamaindex_agent.py
import asyncio
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument
from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.metrics import TaskCompletionMetric
...

instrument_llama_index(instrument.get_dispatcher())

def multiply(a: float, b: float) -> float:
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful calculator.",
)

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(agent.run(golden.input))
    dataset.evaluate(task)
llamaindex_agent.py
import asyncio
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.metrics import TaskCompletionMetric
...

instrument_llama_index(instrument.get_dispatcher())

def multiply(a: float, b: float) -> float:
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful calculator.",
)

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    asyncio.run(agent.run(golden.input))

See the LlamaIndex integration for the full surface.

Register DeepEvalTracingProcessor once, then build your agent with deepeval's Agent and function_tool shims:

openai_agents_app.py
import asyncio
from agents import Runner, add_trace_processor
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool
from deepeval.metrics import TaskCompletionMetric
...

add_trace_processor(DeepEvalTracingProcessor())

@function_tool
def get_weather(city: str) -> str:
    return f"It's always sunny in {city}!"

agent = Agent(
    name="weather_agent",
    instructions="Answer weather questions concisely.",
    tools=[get_weather],
)

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(Runner.run(agent, golden.input))
    dataset.evaluate(task)
openai_agents_app.py
from agents import Runner, add_trace_processor
from deepeval.evaluate import AsyncConfig
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool
from deepeval.metrics import TaskCompletionMetric
...

add_trace_processor(DeepEvalTracingProcessor())

@function_tool
def get_weather(city: str) -> str:
    return f"It's always sunny in {city}!"

agent = Agent(
    name="weather_agent",
    instructions="Answer weather questions concisely.",
    tools=[get_weather],
)

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    Runner.run_sync(agent, golden.input)

See the OpenAI Agents integration for the full surface.

Call instrument_google_adk() once before building your LlmAgent. ADK's runner.run_async(...) is async-only, so the sync variant uses asyncio.run(...):

google_adk_agent.py
import asyncio
from google.adk.agents import LlmAgent
from google.adk.runners import InMemoryRunner
from google.genai import types
from deepeval.integrations.google_adk import instrument_google_adk
from deepeval.metrics import TaskCompletionMetric
...

instrument_google_adk()

agent = LlmAgent(model="gemini-2.0-flash", name="assistant", instruction="Be concise.")
runner = InMemoryRunner(agent=agent, app_name="deepeval-quickstart")

async def run_agent(prompt: str) -> str:
    session = await runner.session_service.create_session(
        app_name="deepeval-quickstart", user_id="demo-user",
    )
    message = types.Content(role="user", parts=[types.Part(text=prompt)])
    async for event in runner.run_async(
        user_id="demo-user", session_id=session.id, new_message=message,
    ):
        if event.is_final_response() and event.content:
            return "".join(part.text for part in event.content.parts if getattr(part, "text", None))
    return ""

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)
google_adk_agent.py
import asyncio
from google.adk.agents import LlmAgent
from google.adk.runners import InMemoryRunner
from google.genai import types
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.google_adk import instrument_google_adk
from deepeval.metrics import TaskCompletionMetric
...

instrument_google_adk()

agent = LlmAgent(model="gemini-2.0-flash", name="assistant", instruction="Be concise.")
runner = InMemoryRunner(agent=agent, app_name="deepeval-quickstart")

async def run_agent(prompt: str) -> str:
    session = await runner.session_service.create_session(
        app_name="deepeval-quickstart", user_id="demo-user",
    )
    message = types.Content(role="user", parts=[types.Part(text=prompt)])
    async for event in runner.run_async(
        user_id="demo-user", session_id=session.id, new_message=message,
    ):
        if event.is_final_response() and event.content:
            return "".join(part.text for part in event.content.parts if getattr(part, "text", None))
    return ""

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    asyncio.run(run_agent(golden.input))

See the Google ADK integration for the full surface.

Call instrument_crewai() once, then build your crew with deepeval's Crew, Agent, and @tool shims:

crewai_app.py
import asyncio
from crewai import Task
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent
from deepeval.metrics import TaskCompletionMetric
...

instrument_crewai()

tutor = Agent(
    role="Math Tutor",
    goal="Answer math questions accurately and concisely.",
    backstory="An experienced tutor who explains simple math clearly.",
)
answer_task = Task(
    description="{question}",
    expected_output="An accurate, concise answer.",
    agent=tutor,
)
crew = Crew(agents=[tutor], tasks=[answer_task])

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    task = asyncio.create_task(crew.kickoff_async({"question": golden.input}))
    dataset.evaluate(task)
crewai_app.py
from crewai import Task
from deepeval.evaluate import AsyncConfig
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent
from deepeval.metrics import TaskCompletionMetric
...

instrument_crewai()

tutor = Agent(
    role="Math Tutor",
    goal="Answer math questions accurately and concisely.",
    backstory="An experienced tutor who explains simple math clearly.",
)
task = Task(
    description="{question}",
    expected_output="An accurate, concise answer.",
    agent=tutor,
)
crew = Crew(agents=[tutor], tasks=[task])

for golden in dataset.evals_iterator(
    metrics=[TaskCompletionMetric()],
    async_config=AsyncConfig(run_async=False),
):
    crew.kickoff({"question": golden.input})

See the CrewAI integration for the full surface.

There are SIX optional parameters on evals_iterator():

  • [Optional] metrics: a list of BaseMetrics applied at the trace level — these are the end-to-end metrics that score the whole trace.
  • [Optional] identifier: a string label for this test run on Confident AI.
  • [Optional] async_config: an AsyncConfig controlling concurrency. See async configs.
  • [Optional] display_config: a DisplayConfig controlling console output. See display configs.
  • [Optional] error_config: an ErrorConfig controlling error handling. See error configs.
  • [Optional] cache_config: a CacheConfig controlling caching. See cache configs.

Every evals_iterator() run is snapshotted to disk, so you can open it in a trace-tree TUI with bare deepeval inspect. See the deepeval inspect reference for full details.

To grade individual components (the retriever, a tool call, an inner LLM call) instead of (or in addition to) the trace, see component-level evaluation.

If you're logged in to Confident AI via deepeval login, you'll also get to storage, share, view, and annotate full traces in testing reports on the platform:

Approach 2: evaluate()

Use this when you can't (or don't want to) instrument your app — for example a QA engineer testing a deployed system, or a quick one-off eval where adding tracing is overkill. You build a list of LLMTestCases up front from inputs and outputs you've already collected, pick metrics, and call evaluate().

How it works:

  1. You build a list of LLMTestCases yourself by looping over goldens and calling your LLM app.
  2. You hand the test cases and metrics to evaluate() in a single call.
  3. deepeval runs every metric on every test case (concurrently by default) and rolls the results into a test run.

Your LLM app and deepeval stay completely decoupled — evaluate() only sees the data you pass to it. That's why this approach has no tracing dependency.

Build dataset

Same as Approach 1 — wrap your goldens in an EvaluationDataset. Pick whichever source fits where your goldens live today:

from deepeval.dataset import Golden, EvaluationDataset

goldens = [
    Golden(input="What is your name?"),
    Golden(input="Choose a number between 1 and 100"),
    # ...
]

dataset = EvaluationDataset(goldens=goldens)
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
    file_path="example.csv",
    input_col_name="query",
)
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
    file_path="example.json",
    input_key_name="query",
)

To persist a dataset (push to Confident AI, save as CSV/JSON, version across runs), see the datasets page.

Construct test cases

Loop over your goldens, call your LLM app, and wrap each result in an LLMTestCase:

main.py
from your_app import your_llm_app  # replace with your LLM app
from deepeval.test_case import LLMTestCase
...

for golden in dataset.goldens:
    answer, retrieved_chunks = your_llm_app(golden.input)
    dataset.add_test_case(
        LLMTestCase(
            input=golden.input,
            actual_output=answer,
            retrieval_context=retrieved_chunks,
        )
    )

Run evaluate()

Now pick the metrics you want to grade your application on, and pass both test_cases and metrics to evaluate().

main.py
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
...

evaluate(
    test_cases=test_cases,
    metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
)

There are TWO mandatory and FIVE optional parameters when calling evaluate() for end-to-end evaluation:

  • test_cases: a list of LLMTestCases OR ConversationalTestCases, or an EvaluationDataset. You cannot mix LLMTestCases and ConversationalTestCases in the same test run.
  • metrics: a list of metrics of type BaseMetric.
  • [Optional] identifier: a string label for this test run on Confident AI.
  • [Optional] async_config: an AsyncConfig controlling concurrency. See async configs.
  • [Optional] display_config: a DisplayConfig controlling console output. See display configs.
  • [Optional] error_config: an ErrorConfig controlling how errors are handled. See error configs.
  • [Optional] cache_config: a CacheConfig controlling caching behavior. See cache configs.

This is the same as assert_test() in deepeval test run, exposed as a function call instead.

Hyperparameters

Log the model, prompt, and other configuration values with each test run so you can compare runs side-by-side on Confident AI and identify the best combination. Values must be str | int | float or a Prompt:

import deepeval
from deepeval.metrics import TaskCompletionMetric

@deepeval.log_hyperparameters
def hyperparameters():
    return {"model": "gpt-4.1", "system_prompt": "Be concise."}

for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    my_ai_agent(golden.input)

On Confident AI, the logged values become filterable axes for comparing test runs and surfacing the model/prompt configuration that performs best:

In CI/CD

To run single-turn end-to-end evaluations on every PR, swap evaluate() / evals_iterator() for assert_test() inside a pytest parametrized test, then run it with deepeval test run.

test_llm_app.py
import pytest
from deepeval import assert_test
from deepeval.dataset import Golden
from deepeval.metrics import TaskCompletionMetric
from your_app import my_ai_agent  # @observe-instrumented

@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
    my_ai_agent(golden.input)
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])
test_llm_app.py
import pytest
from deepeval import assert_test
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from your_app import my_ai_agent

@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
    output = my_ai_agent(golden.input)
    test_case = LLMTestCase(input=golden.input, actual_output=output)
    assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])
deepeval test run test_llm_app.py

See unit testing in CI/CD for assert_test() parameters, YAML pipeline examples, and deepeval test run flags.

FAQs

What is single-turn end-to-end evaluation?
It treats your LLM app as a black box and scores its overall input and output for one atomic interaction, rather than scoring individual internal components. It's the simplest way to get started with evals.
Should I use evals_iterator() with tracing or plain evaluate()?
The recommended approach is evals_iterator() with tracing since it captures rich execution data and scales to nested components later. Use plain evaluate() if you just want to score a list of test cases without instrumenting your app.
Do I need to instrument my app with tracing to run end-to-end evals?
No. You can construct LLMTestCases directly and pass them to evaluate(). Tracing is optional and mainly helps when you want to graduate to component-level evals.
Does deepeval integrate with my agent framework?
Yes. The traced approach works with native integrations for LangChain, LangGraph, LlamaIndex, Pydantic AI, CrewAI, and more, so you can run evals against the framework you already use instead of instrumenting everything by hand.
How do I run these same evals in CI/CD?
Swap evaluate() / evals_iterator() for assert_test() inside a pytest parametrized test and run it with deepeval test run so failing metrics fail the build.
What are hyperparameters used for?
They let you log arbitrary settings like model name, prompt template, and temperature alongside a test run, so you can compare which configuration produced the best scores.
Can my team keep these test runs on the cloud and review results in a UI?
End-to-end runs are local-first. When you're logged into Confident AI (the platform built by the deepeval team), the same run produces a sharable cloud testing report your team can review together, with hyperparameter comparisons and regression tracking over time — no code changes required, and fully optional.

On this page