Multi-Turn End-to-End Evaluation

Multi-turn end-to-end evaluation grades whole conversations, not single exchanges. Each test case is a ConversationalTestCase and each golden is a ConversationalGolden describing a scenario, an expected outcome, and who the user is.

If you haven't already, read the end-to-end overview for the concepts and how multi-turn compares to single-turn.

How Multi-Turn E2E Eval Works

A multi-turn test run is built in two phases: simulation (synthetic user vs. your chatbot) and evaluation (metrics applied to the resulting conversations).

You wrap your chatbot in a model_callback (sync or async) that returns the next assistant Turn.
You build a dataset of ConversationalGoldens — each describes the scenario, expected outcome, and persona of the simulated user.
You hand the goldens + callback to a ConversationSimulator. It plays a synthetic user against your chatbot until the scenario plays out, producing one ConversationalTestCase per golden.
You pass the test cases + multi-turn metrics to evaluate(), which scores them and rolls the results into a test run.

Step-by-Step Guide

Wrap your chatbot in a callback

The ConversationSimulator needs a way to ask your chatbot for its next reply, given the conversation so far. You provide that as a model_callback — either a regular function or an async one; the simulator detects which and dispatches accordingly. The examples below use async def because most modern chat clients are async, but plain def works just as well:

main.py

from typing import List
from deepeval.test_case import Turn

async def model_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
    response = await your_chatbot(input, turns, thread_id)
    return Turn(role="assistant", content=response)

main.py

from typing import List
from deepeval.test_case import Turn
from openai import OpenAI

client = OpenAI()

async def model_callback(input: str, turns: List[Turn]) -> Turn:
    messages = [
        {"role": "system", "content": "You are a ticket purchasing assistant"},
        *[{"role": t.role, "content": t.content} for t in turns],
        {"role": "user", "content": input},
    ]
    response = await client.chat.completions.create(model="gpt-4.1", messages=messages)
    return Turn(role="assistant", content=response.choices[0].message.content)

main.py

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from deepeval.test_case import Turn

store = {}
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_messages([("system", "You are a ticket purchasing assistant."), MessagesPlaceholder(variable_name="history"), ("human", "{input}")])
chain_with_history = RunnableWithMessageHistory(prompt | llm, lambda session_id: store.setdefault(session_id, ChatMessageHistory()), input_messages_key="input", history_messages_key="history")

async def model_callback(input: str, thread_id: str) -> Turn:
    response = chain_with_history.invoke(
        {"input": input},
        config={"configurable": {"session_id": thread_id}},
    )
    return Turn(role="assistant", content=response.content)

main.py

from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.llms.openai import OpenAI
from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.core.memory import ChatMemoryBuffer
from deepeval.test_case import Turn

chat_store = SimpleChatStore()
llm = OpenAI(model="gpt-4")

async def model_callback(input: str, thread_id: str) -> Turn:
    memory = ChatMemoryBuffer.from_defaults(chat_store=chat_store, chat_store_key=thread_id)
    chat_engine = SimpleChatEngine.from_defaults(llm=llm, memory=memory)
    response = chat_engine.chat(input)
    return Turn(role="assistant", content=response.response)

main.py

from agents import Agent, Runner, SQLiteSession
from deepeval.test_case import Turn

sessions = {}
agent = Agent(name="Test Assistant", instructions="You are a helpful assistant that answers questions concisely.")

async def model_callback(input: str, thread_id: str) -> Turn:
    if thread_id not in sessions:
        sessions[thread_id] = SQLiteSession(thread_id)
    session = sessions[thread_id]
    result = await Runner.run(agent, input, session=session)
    return Turn(role="assistant", content=result.final_output)

main.py

from typing import List
from datetime import datetime
from pydantic_ai import Agent
from pydantic_ai.messages import ModelRequest, ModelResponse, UserPromptPart, TextPart
from deepeval.test_case import Turn

agent = Agent('openai:gpt-4', system_prompt="You are a helpful assistant that answers questions concisely.")

async def model_callback(input: str, turns: List[Turn]) -> Turn:
    message_history = []
    for turn in turns:
        if turn.role == "user":
            message_history.append(ModelRequest(parts=[UserPromptPart(content=turn.content, timestamp=datetime.now())], kind='request'))
        elif turn.role == "assistant":
            message_history.append(ModelResponse(parts=[TextPart(content=turn.content)], model_name='gpt-4', timestamp=datetime.now(), kind='response'))
    result = await agent.run(input, message_history=message_history)
    return Turn(role="assistant", content=result.output)

See Conversation Simulator → Model Callback for the full callback contract, including custom argument injection.

Build dataset

A ConversationalGolden describes the situation the simulated user is in, what success looks like, and who they are. Wrap a list of them in an EvaluationDataset so the simulator can iterate. Pick whichever source fits where your goldens live today:

from deepeval.dataset import ConversationalGolden, EvaluationDataset

goldens = [
    ConversationalGolden(
        scenario="Andy Byron wants to purchase a VIP ticket to a Coldplay concert.",
        expected_outcome="Successful purchase of a ticket.",
        user_description="Andy Byron is the CEO of Astronomer.",
    ),
    # ...
]

dataset = EvaluationDataset(goldens=goldens)

The dataset lives only for this run — no push, no save. Perfect for quickstarts and one-off evaluations.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My multi-turn dataset")

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
    file_path="conversations.csv",
    scenario_col_name="scenario",
    expected_outcome_col_name="expected_outcome",
    user_description_col_name="user_description",
)

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
    file_path="conversations.json",
    scenario_key_name="scenario",
    expected_outcome_key_name="expected_outcome",
    user_description_key_name="user_description",
)

Simulate turns

Hand the goldens and the callback to a ConversationSimulator to produce a list of ConversationalTestCases:

main.py

from deepeval.conversation_simulator import ConversationSimulator

simulator = ConversationSimulator(model_callback=model_callback)
conversational_test_cases = simulator.simulate(
    conversational_goldens=dataset.goldens,
    max_user_simulations=10,
)

The simulator exposes additional configuration beyond what fits here — see stopping logic, custom templates, and lifecycle hooks for the full surface.

Click to view an example simulated test case

The simulator carries scenario, expected_outcome, and user_description over from the golden, and fills in turns:

ConversationalTestCase(
    scenario="Andy Byron wants to purchase a VIP ticket to a Coldplay concert.",
    expected_outcome="Successful purchase of a ticket.",
    user_description="Andy Byron is the CEO of Astronomer.",
    turns=[
        Turn(role="user", content="Hi, I'd like to buy a VIP ticket for the Coldplay show."),
        Turn(role="assistant", content="Sure — which date and city are you looking for?"),
        Turn(role="user", content="The November 12 show in NYC."),
        Turn(role="assistant", content="Got it. That'll be $850. Shall I proceed?"),
        # ...
    ],
)

Run `evaluate()`

Pass the simulated test cases and your multi-turn metrics to evaluate():

Default. Metrics dispatch concurrently across conversations for the fastest run.

main.py

from deepeval import evaluate
from deepeval.metrics import TurnRelevancyMetric

evaluate(
    test_cases=conversational_test_cases,
    metrics=[TurnRelevancyMetric()],
)

Pass AsyncConfig(run_async=False) to score conversations one at a time. Useful for debugging, rate-limited providers, or anywhere asyncio gets in the way (e.g. some Jupyter setups).

main.py

from deepeval import evaluate
from deepeval.evaluate import AsyncConfig
from deepeval.metrics import TurnRelevancyMetric

evaluate(
    test_cases=conversational_test_cases,
    metrics=[TurnRelevancyMetric()],
    async_config=AsyncConfig(run_async=False),
)

There are TWO mandatory and FIVE optional parameters when calling evaluate() for multi-turn end-to-end evaluation:

test_cases: a list of ConversationalTestCases (or an EvaluationDataset). You cannot mix LLMTestCases and ConversationalTestCases in the same test run.
metrics: a list of metrics of type BaseConversationalMetric. See the multi-turn metrics for the full list (e.g. TurnRelevancyMetric, KnowledgeRetentionMetric, RoleAdherenceMetric, ConversationCompletenessMetric).
[Optional] identifier: a string label for this test run.
[Optional] async_config: an AsyncConfig controlling concurrency. See async configs.
[Optional] display_config: a DisplayConfig controlling console output. See display configs.
[Optional] error_config: an ErrorConfig controlling error handling. See error configs.
[Optional] cache_config: a CacheConfig controlling caching. See cache configs.

Note that simulation and evaluation have separate concurrency controls — ConversationSimulator(max_concurrent=...) decides how many conversations are simulated in parallel; AsyncConfig only affects how those finished conversations are scored.

We highly recommend setting up Confident AI with your deepeval evaluations to get professional test reports and observe your application's performance over time:

Test Reports After Running Evals on Confident AI

Hyperparameters

Log the model, prompt, and other configuration values with each test run so you can compare runs side-by-side on Confident AI and identify the best combination. Values must be str | int | float or a Prompt. Pass them directly to evaluate():

evaluate(
    test_cases=conversational_test_cases,
    metrics=[TurnRelevancyMetric()],
    hyperparameters={"model": "gpt-4.1", "system_prompt": "Be concise."},
)

On Confident AI, the logged values become filterable axes for comparing test runs and surfacing the configuration that performs best.

In CI/CD

To run multi-turn end-to-end evaluations on every PR, simulate conversations once at module load, then assert_test() each one inside a pytest parametrized test:

test_chatbot.py

import pytest
from deepeval import assert_test
from deepeval.test_case import ConversationalTestCase
from deepeval.metrics import TurnRelevancyMetric
from deepeval.conversation_simulator import ConversationSimulator
from your_app import model_callback

simulator = ConversationSimulator(model_callback=model_callback)
test_cases = simulator.simulate(goldens=dataset.goldens, max_turns=10)

@pytest.mark.parametrize("test_case", test_cases)
def test_chatbot(test_case: ConversationalTestCase):
    assert_test(test_case=test_case, metrics=[TurnRelevancyMetric()])

deepeval test run test_chatbot.py

See unit testing in CI/CD for assert_test() parameters, YAML pipeline examples, and deepeval test run flags.

On this page