Single-Turn End-to-End Evaluation
A single-turn end-to-end test scores one input → one output per LLM interaction, captured as an LLMTestCase. This is the right flavor for any LLM application with a "flat" shape — agents treated as a black box, RAG / QA, summarization, classifiers, writing assistants, and so on.
If you haven't already, read the end-to-end overview for the concepts and how single-turn compares to multi-turn.
There are two ways to run a single-turn E2E test:
| Approach | When to choose it |
|---|---|
dataset.evals_iterator() with @observe tracing — recommended | Your app is (or can be) instrumented with tracing. Test cases are built from traces automatically, and you get per-test-case traces on Confident AI for free. |
evaluate(test_cases=...) | You can't (or don't want to) instrument your app — e.g. a QA engineer evaluating a deployed system. You build LLMTestCases up front and hand them to evaluate(). |
For projects you own, prefer evals_iterator() — same code, plus traces, plus a clean upgrade path to component-level evaluation.
Approach 1: evals_iterator() with tracing (recommended)
If your LLM app is (or will be) instrumented with tracing, you don't need to manually build test cases — deepeval will build them from the trace and you get full trace visibility on Confident AI as a bonus. This is the recommended path: it's the same amount of code as Approach 2, you also get traces on every test case, and the same setup is what you'd use for component-level evaluation.
How it works:
- Your traced LLM app emits a trace whenever it runs (via
@observeor a framework integration). dataset.evals_iterator()opens a test run and yields each golden one at a time.- Inside the loop, you call your traced app with
golden.input.deepevalcaptures the resulting trace. - After each iteration,
deepevalbuilds anLLMTestCasefrom the trace, applies your metrics, and attaches the scored test case to the trace. - When the loop finishes, the trace + test case + metric scores upload together as one test run.
This same setup also clicks into component-level evaluation for free — once your app is traced, you can attach metrics to individual @observe'd spans in the same loop, and they'll be scored alongside the trace-level metrics.
Instrument/trace your AI
Tracing captures your LLM app's inputs, outputs, and internal spans so deepeval can build test cases from the trace automatically.
Wrap the top-level function of your LLM app with @observe, and call update_current_trace(...) to set the trace-level test case fields:
from deepeval.tracing import observe, update_current_trace
@observe()
def my_ai_agent(query: str) -> str:
answer = "..." # call your LLM here
# explicitly set test case parameters on trace
update_current_trace(input=query, output=answer)
return answerSee tracing for the full @observe and update_current_trace surface.
Pass deepeval's CallbackHandler to your chain's invoke method.
from langchain.chat_models import init_chat_model
from deepeval.integrations.langchain import CallbackHandler
def multiply(a: int, b: int) -> int:
return a * b
llm = init_chat_model("gpt-4.1", model_provider="openai")
llm_with_tools = llm.bind_tools([multiply])
llm_with_tools.invoke(
"What is 3 * 12?",
config={"callbacks": [CallbackHandler()]},
)See the LangChain integration for the full surface.
Pass deepeval's CallbackHandler to your agent's invoke method.
from langgraph.prebuilt import create_react_agent
from deepeval.integrations.langchain import CallbackHandler
def get_weather(city: str) -> str:
return f"It's always sunny in {city}!"
agent = create_react_agent(
model="openai:gpt-4.1",
tools=[get_weather],
prompt="You are a helpful assistant",
)
agent.invoke(
input={"messages": [{"role": "user", "content": "what is the weather in sf"}]},
config={"callbacks": [CallbackHandler()]},
)See the LangGraph integration for the full surface.
Drop-in replace from openai import OpenAI with from deepeval.openai import OpenAI. Every chat.completions.create(...), chat.completions.parse(...), and responses.create(...) call becomes an LLM span automatically.
from deepeval.openai import OpenAI
client = OpenAI()
client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)See the OpenAI integration for the full surface (including async, streaming, and tool-calling).
Pass DeepEvalInstrumentationSettings() to your Agent's instrument keyword.
from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
agent = Agent(
"openai:gpt-4.1",
system_prompt="Be concise.",
instrument=DeepEvalInstrumentationSettings(),
)
agent.run_sync("Greetings, AI Agent.")See the Pydantic AI integration for the full surface.
Call instrument_agentcore() before creating your AgentCore app. The same call also instruments Strands agents running inside AgentCore.
from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore
instrument_agentcore()
app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")
@app.entrypoint
def invoke(payload, context):
return {"result": str(agent(payload.get("prompt")))}See the AgentCore integration for the full surface (including Strands-specific spans).
Drop-in replace from anthropic import Anthropic with from deepeval.anthropic import Anthropic. Every messages.create(...) call becomes an LLM span automatically.
from deepeval.anthropic import Anthropic
client = Anthropic()
client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}],
)See the Anthropic integration for the full surface (including async, streaming, and tool-use).
Register deepeval's event handler against LlamaIndex's instrumentation dispatcher.
import asyncio
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument
from deepeval.integrations.llama_index import instrument_llama_index
instrument_llama_index(instrument.get_dispatcher())
def multiply(a: float, b: float) -> float:
return a * b
agent = FunctionAgent(
tools=[multiply],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="You are a helpful calculator.",
)
asyncio.run(agent.run("What is 8 multiplied by 6?"))See the LlamaIndex integration for the full surface.
Register DeepEvalTracingProcessor once, then build your agent with deepeval's Agent and function_tool shims.
from agents import Runner, add_trace_processor
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool
add_trace_processor(DeepEvalTracingProcessor())
@function_tool
def get_weather(city: str) -> str:
return f"It's always sunny in {city}!"
agent = Agent(
name="weather_agent",
instructions="Answer weather questions concisely.",
tools=[get_weather],
)
Runner.run_sync(agent, "What's the weather in Paris?")See the OpenAI Agents integration for the full surface.
Call instrument_google_adk() once before building your LlmAgent.
import asyncio
from google.adk.agents import LlmAgent
from google.adk.runners import InMemoryRunner
from google.genai import types
from deepeval.integrations.google_adk import instrument_google_adk
instrument_google_adk()
agent = LlmAgent(model="gemini-2.0-flash", name="assistant", instruction="Be concise.")
runner = InMemoryRunner(agent=agent, app_name="deepeval-quickstart")See the Google ADK integration for the full surface.
Call instrument_crewai() once, then build your crew with deepeval's Crew, Agent, and @tool shims.
from crewai import Task
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent
instrument_crewai()
coder = Agent(
role="Consultant",
goal="Write a clear, concise explanation.",
backstory="An expert consultant with a keen eye for software trends.",
)
task = Task(
description="Explain the latest trends in AI.",
agent=coder,
expected_output="A clear and concise explanation.",
)
crew = Crew(agents=[coder], tasks=[task])
crew.kickoff()See the CrewAI integration for the full surface.
Build dataset
Datasets in deepeval store Goldens, which act as precursors to test cases. You loop over goldens at evaluation time, run your LLM app on each, and turn the result into a test case — that way the dataset stays decoupled from any specific app version.
from deepeval.dataset import Golden, EvaluationDataset
goldens = [
Golden(input="What is your name?"),
Golden(input="Choose a number between 1 and 100"),
# ...
]
dataset = EvaluationDataset(goldens=goldens)from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My dataset")from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
file_path="example.csv",
input_col_name="query",
)from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
file_path="example.json",
input_key_name="query",
)You can also generate goldens automatically with the Synthesizer.
Loop with evals_iterator()
Pass your metrics to evals_iterator() and call your traced LLM app inside the loop. Each iteration captures one app run as a trace, then scores that whole trace as one end-to-end test case:
The loop runs asynchronous by default. Wrap each agent call in asyncio.create_task(...) and hand the task to dataset.evaluate(...) so goldens run concurrently:
import asyncio
from deepeval.metrics import TaskCompletionMetric
from deepeval.dataset import EvaluationDataset
...
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
# Create async task to run agent, deepeval
# captures and evaluates entire trace
task = asyncio.create_task(a_my_ai_agent(golden.input))
dataset.evaluate(task)This requires a_my_ai_agent to be an async def (or otherwise return a coroutine).
Pass AsyncConfig(run_async=False) to score metrics one at a time. Useful for debugging, rate-limited providers, or anywhere asyncio gets in the way (e.g. some Jupyter setups).
from deepeval.evaluate import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
from deepeval.dataset import EvaluationDataset
...
for golden in dataset.evals_iterator(
metrics=[TaskCompletionMetric()],
async_config=AsyncConfig(run_async=False),
):
my_ai_agent(golden.input)There are SIX optional parameters on evals_iterator():
- [Optional]
metrics: a list ofBaseMetrics applied at the trace (end-to-end) level. - [Optional]
identifier: a string label for this test run on Confident AI. - [Optional]
async_config: anAsyncConfigcontrolling concurrency. See async configs. - [Optional]
display_config: aDisplayConfigcontrolling console output. See display configs. - [Optional]
error_config: anErrorConfigcontrolling error handling. See error configs. - [Optional]
cache_config: aCacheConfigcontrolling caching. See cache configs.
Note that passing metrics=[...] to evals_iterator() attaches them at the trace level — i.e. end-to-end. To grade individual components (the retriever, a tool call, an inner LLM call), attach metrics on the @observe(metrics=[...]) decorator of that span instead — that's component-level evaluation, not end-to-end.
If you're logged in to Confident AI via deepeval login, you'll also get to see full traces in testing reports on the platform:
Approach 2: evaluate()
Use this when you can't (or don't want to) instrument your app — for example a QA engineer testing a deployed system, or a quick one-off eval where adding tracing is overkill. You build a list of LLMTestCases up front from inputs and outputs you've already collected, pick metrics, and call evaluate().
How it works:
- You build a list of
LLMTestCases yourself by looping over goldens and calling your LLM app. - You hand the test cases and metrics to
evaluate()in a single call. deepevalruns every metric on every test case (concurrently by default) and rolls the results into a test run.
Your LLM app and deepeval stay completely decoupled — evaluate() only sees the data you pass to it. That's why this approach has no tracing dependency.
Build dataset
Same as Approach 1 — wrap your goldens in an EvaluationDataset. Pick whichever source fits where your goldens live today:
from deepeval.dataset import Golden, EvaluationDataset
goldens = [
Golden(input="What is your name?"),
Golden(input="Choose a number between 1 and 100"),
# ...
]
dataset = EvaluationDataset(goldens=goldens)from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
file_path="example.csv",
input_col_name="query",
)from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
file_path="example.json",
input_key_name="query",
)To persist a dataset (push to Confident AI, save as CSV/JSON, version across runs), see the datasets page.
Construct test cases
Loop over your goldens, call your LLM app, and wrap each result in an LLMTestCase:
from your_app import your_llm_app # replace with your LLM app
from deepeval.test_case import LLMTestCase
...
for golden in dataset.goldens:
answer, retrieved_chunks = your_llm_app(golden.input)
dataset.add_test_case(
LLMTestCase(
input=golden.input,
actual_output=answer,
retrieval_context=retrieved_chunks,
)
)Run evaluate()
Now pick the metrics you want to grade your application on, and pass both test_cases and metrics to evaluate().
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
...
evaluate(
test_cases=test_cases,
metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
)There are TWO mandatory and FIVE optional parameters when calling evaluate() for end-to-end evaluation:
test_cases: a list ofLLMTestCases ORConversationalTestCases, or anEvaluationDataset. You cannot mixLLMTestCases andConversationalTestCases in the same test run.metrics: a list of metrics of typeBaseMetric.- [Optional]
identifier: a string label for this test run on Confident AI. - [Optional]
async_config: anAsyncConfigcontrolling concurrency. See async configs. - [Optional]
display_config: aDisplayConfigcontrolling console output. See display configs. - [Optional]
error_config: anErrorConfigcontrolling how errors are handled. See error configs. - [Optional]
cache_config: aCacheConfigcontrolling caching behavior. See cache configs.
This is the same as assert_test() in deepeval test run, exposed as a function call instead.
Hyperparameters
Log the model, prompt, and other configuration values with each test run so you can compare runs side-by-side on Confident AI and identify the best combination. Values must be str | int | float or a Prompt:
import deepeval
from deepeval.metrics import TaskCompletionMetric
@deepeval.log_hyperparameters
def hyperparameters():
return {"model": "gpt-4.1", "system_prompt": "Be concise."}
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
my_ai_agent(golden.input)On Confident AI, the logged values become filterable axes for comparing test runs and surfacing the model/prompt configuration that performs best:
In CI/CD
To run single-turn end-to-end evaluations on every PR, swap evaluate() / evals_iterator() for assert_test() inside a pytest parametrized test, then run it with deepeval test run.
import pytest
from deepeval import assert_test
from deepeval.dataset import Golden
from deepeval.metrics import TaskCompletionMetric
from your_app import my_ai_agent # @observe-instrumented
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
my_ai_agent(golden.input)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])import pytest
from deepeval import assert_test
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from your_app import my_ai_agent
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
output = my_ai_agent(golden.input)
test_case = LLMTestCase(input=golden.input, actual_output=output)
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])deepeval test run test_llm_app.pySee unit testing in CI/CD for assert_test() parameters, YAML pipeline examples, and deepeval test run flags.