Multi-Turn End-to-End Evaluation
Multi-turn end-to-end evaluation grades whole conversations, not single exchanges. Each test case is a ConversationalTestCase and each golden is a ConversationalGolden describing a scenario, an expected outcome, and who the user is.
If you haven't already, read the end-to-end overview for the concepts and how multi-turn compares to single-turn.
How Multi-Turn E2E Eval Works
A multi-turn test run is built in two phases: simulation (synthetic user vs. your chatbot) and evaluation (metrics applied to the resulting conversations).
- You wrap your chatbot in a
model_callback(sync or async) that returns the next assistantTurn. - You build a dataset of
ConversationalGoldens — each describes the scenario, expected outcome, and persona of the simulated user. - You hand the goldens + callback to a
ConversationSimulator. It plays a synthetic user against your chatbot until the scenario plays out, producing oneConversationalTestCaseper golden. - You pass the test cases + multi-turn metrics to
evaluate(), which scores them and rolls the results into a test run.
Step-by-Step Guide
Wrap your chatbot in a callback
The ConversationSimulator needs a way to ask your chatbot for its next reply, given the conversation so far. You provide that as a model_callback — either a regular function or an async one; the simulator detects which and dispatches accordingly. The examples below use async def because most modern chat clients are async, but plain def works just as well:
from typing import List
from deepeval.test_case import Turn
async def model_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
response = await your_chatbot(input, turns, thread_id)
return Turn(role="assistant", content=response)from typing import List
from deepeval.test_case import Turn
from openai import OpenAI
client = OpenAI()
async def model_callback(input: str, turns: List[Turn]) -> Turn:
messages = [
{"role": "system", "content": "You are a ticket purchasing assistant"},
*[{"role": t.role, "content": t.content} for t in turns],
{"role": "user", "content": input},
]
response = await client.chat.completions.create(model="gpt-4.1", messages=messages)
return Turn(role="assistant", content=response.choices[0].message.content)from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from deepeval.test_case import Turn
store = {}
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_messages([("system", "You are a ticket purchasing assistant."), MessagesPlaceholder(variable_name="history"), ("human", "{input}")])
chain_with_history = RunnableWithMessageHistory(prompt | llm, lambda session_id: store.setdefault(session_id, ChatMessageHistory()), input_messages_key="input", history_messages_key="history")
async def model_callback(input: str, thread_id: str) -> Turn:
response = chain_with_history.invoke(
{"input": input},
config={"configurable": {"session_id": thread_id}},
)
return Turn(role="assistant", content=response.content)from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.llms.openai import OpenAI
from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.core.memory import ChatMemoryBuffer
from deepeval.test_case import Turn
chat_store = SimpleChatStore()
llm = OpenAI(model="gpt-4")
async def model_callback(input: str, thread_id: str) -> Turn:
memory = ChatMemoryBuffer.from_defaults(chat_store=chat_store, chat_store_key=thread_id)
chat_engine = SimpleChatEngine.from_defaults(llm=llm, memory=memory)
response = chat_engine.chat(input)
return Turn(role="assistant", content=response.response)from agents import Agent, Runner, SQLiteSession
from deepeval.test_case import Turn
sessions = {}
agent = Agent(name="Test Assistant", instructions="You are a helpful assistant that answers questions concisely.")
async def model_callback(input: str, thread_id: str) -> Turn:
if thread_id not in sessions:
sessions[thread_id] = SQLiteSession(thread_id)
session = sessions[thread_id]
result = await Runner.run(agent, input, session=session)
return Turn(role="assistant", content=result.final_output)from typing import List
from datetime import datetime
from pydantic_ai import Agent
from pydantic_ai.messages import ModelRequest, ModelResponse, UserPromptPart, TextPart
from deepeval.test_case import Turn
agent = Agent('openai:gpt-4', system_prompt="You are a helpful assistant that answers questions concisely.")
async def model_callback(input: str, turns: List[Turn]) -> Turn:
message_history = []
for turn in turns:
if turn.role == "user":
message_history.append(ModelRequest(parts=[UserPromptPart(content=turn.content, timestamp=datetime.now())], kind='request'))
elif turn.role == "assistant":
message_history.append(ModelResponse(parts=[TextPart(content=turn.content)], model_name='gpt-4', timestamp=datetime.now(), kind='response'))
result = await agent.run(input, message_history=message_history)
return Turn(role="assistant", content=result.output)See Conversation Simulator → Model Callback for the full callback contract, including custom argument injection.
Build dataset
A ConversationalGolden describes the situation the simulated user is in, what success looks like, and who they are. Wrap a list of them in an EvaluationDataset so the simulator can iterate. Pick whichever source fits where your goldens live today:
from deepeval.dataset import ConversationalGolden, EvaluationDataset
goldens = [
ConversationalGolden(
scenario="Andy Byron wants to purchase a VIP ticket to a Coldplay concert.",
expected_outcome="Successful purchase of a ticket.",
user_description="Andy Byron is the CEO of Astronomer.",
),
# ...
]
dataset = EvaluationDataset(goldens=goldens)The dataset lives only for this run — no push, no save. Perfect for quickstarts and one-off evaluations.
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My multi-turn dataset")from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
file_path="conversations.csv",
scenario_col_name="scenario",
expected_outcome_col_name="expected_outcome",
user_description_col_name="user_description",
)from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
file_path="conversations.json",
scenario_key_name="scenario",
expected_outcome_key_name="expected_outcome",
user_description_key_name="user_description",
)Simulate turns
Hand the goldens and the callback to a ConversationSimulator to produce a list of ConversationalTestCases:
from deepeval.conversation_simulator import ConversationSimulator
simulator = ConversationSimulator(model_callback=model_callback)
conversational_test_cases = simulator.simulate(
conversational_goldens=dataset.goldens,
max_user_simulations=10,
)The simulator exposes additional configuration beyond what fits here — see stopping logic, custom templates, and lifecycle hooks for the full surface.
Click to view an example simulated test case
The simulator carries scenario, expected_outcome, and user_description over from the golden, and fills in turns:
ConversationalTestCase(
scenario="Andy Byron wants to purchase a VIP ticket to a Coldplay concert.",
expected_outcome="Successful purchase of a ticket.",
user_description="Andy Byron is the CEO of Astronomer.",
turns=[
Turn(role="user", content="Hi, I'd like to buy a VIP ticket for the Coldplay show."),
Turn(role="assistant", content="Sure — which date and city are you looking for?"),
Turn(role="user", content="The November 12 show in NYC."),
Turn(role="assistant", content="Got it. That'll be $850. Shall I proceed?"),
# ...
],
)Run evaluate()
Pass the simulated test cases and your multi-turn metrics to evaluate():
Default. Metrics dispatch concurrently across conversations for the fastest run.
from deepeval import evaluate
from deepeval.metrics import TurnRelevancyMetric
evaluate(
test_cases=conversational_test_cases,
metrics=[TurnRelevancyMetric()],
)Pass AsyncConfig(run_async=False) to score conversations one at a time. Useful for debugging, rate-limited providers, or anywhere asyncio gets in the way (e.g. some Jupyter setups).
from deepeval import evaluate
from deepeval.evaluate import AsyncConfig
from deepeval.metrics import TurnRelevancyMetric
evaluate(
test_cases=conversational_test_cases,
metrics=[TurnRelevancyMetric()],
async_config=AsyncConfig(run_async=False),
)There are TWO mandatory and FIVE optional parameters when calling evaluate() for multi-turn end-to-end evaluation:
test_cases: a list ofConversationalTestCases (or anEvaluationDataset). You cannot mixLLMTestCases andConversationalTestCases in the same test run.metrics: a list of metrics of typeBaseConversationalMetric. See the multi-turn metrics for the full list (e.g.TurnRelevancyMetric,KnowledgeRetentionMetric,RoleAdherenceMetric,ConversationCompletenessMetric).- [Optional]
identifier: a string label for this test run. - [Optional]
async_config: anAsyncConfigcontrolling concurrency. See async configs. - [Optional]
display_config: aDisplayConfigcontrolling console output. See display configs. - [Optional]
error_config: anErrorConfigcontrolling error handling. See error configs. - [Optional]
cache_config: aCacheConfigcontrolling caching. See cache configs.
Note that simulation and evaluation have separate concurrency controls — ConversationSimulator(max_concurrent=...) decides how many conversations are simulated in parallel; AsyncConfig only affects how those finished conversations are scored.
We highly recommend setting up Confident AI with your deepeval evaluations to get professional test reports and observe your application's performance over time:
Hyperparameters
Log the model, prompt, and other configuration values with each test run so you can compare runs side-by-side on Confident AI and identify the best combination. Values must be str | int | float or a Prompt. Pass them directly to evaluate():
evaluate(
test_cases=conversational_test_cases,
metrics=[TurnRelevancyMetric()],
hyperparameters={"model": "gpt-4.1", "system_prompt": "Be concise."},
)On Confident AI, the logged values become filterable axes for comparing test runs and surfacing the configuration that performs best.
In CI/CD
To run multi-turn end-to-end evaluations on every PR, simulate conversations once at module load, then assert_test() each one inside a pytest parametrized test:
import pytest
from deepeval import assert_test
from deepeval.test_case import ConversationalTestCase
from deepeval.metrics import TurnRelevancyMetric
from deepeval.conversation_simulator import ConversationSimulator
from your_app import model_callback
simulator = ConversationSimulator(model_callback=model_callback)
test_cases = simulator.simulate(goldens=dataset.goldens, max_turns=10)
@pytest.mark.parametrize("test_case", test_cases)
def test_chatbot(test_case: ConversationalTestCase):
assert_test(test_case=test_case, metrics=[TurnRelevancyMetric()])deepeval test run test_chatbot.pySee unit testing in CI/CD for assert_test() parameters, YAML pipeline examples, and deepeval test run flags.