Unit Testing in CI/CD
Integrate LLM evaluations into your CI/CD pipeline with deepeval to catch regressions before they ship. deepeval plugs into pytest via assert_test() and the deepeval test run command, so every push (or every PR) runs the same evals you'd run locally — single-turn or multi-turn, end-to-end or component-level.
How It Works
Unit testing in CI/CD is the same three steps regardless of which flavor of evaluation you're running:
- Load your dataset — pull goldens from Confident AI, a CSV, or a JSON file. This step is identical for every flavor.
- Construct test cases & write your test — this is where the flavor matters. End-to-end vs component-level, single-turn vs multi-turn, and (for single-turn) instrumented vs un-instrumented all change what you put inside the
pytesttest. - Run with
deepeval test run— same command for every flavor. Drops into a.ymlfile unchanged.
deepeval's pytest integration allows you to leverage all of pytest flags and functionalities, as well as capabilities offered by deepeval, which you can learn more about below.
Step-by-Step Guide
Load your dataset
deepeval loads datasets from Confident AI, a CSV, a JSON file, or directly in code into an EvaluationDataset.
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
file_path="example.csv",
input_col_name="query",
)from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
file_path="example.json",
input_key_name="query",
)from deepeval.dataset import Golden, EvaluationDataset
goldens = [
Golden(input="What is your name?"),
Golden(input="Choose a number between 1 and 100"),
# ...
]
dataset = EvaluationDataset(goldens=goldens)Construct test cases
Pick the flavor that matches your application — single-turn (one input → one output) or multi-turn (whole conversations).
Within single-turn, we strongly recommend instrumenting your app with tracing so deepeval can build the LLMTestCase automatically from each run, and you get a full per-test-case trace on Confident AI for free.
The same setup also unlocks component-level evaluation, where metrics live on individual spans (retrievers, tool calls, sub-agents) instead of the trace as a whole.
Instrument/Trace with Evals
Each example below is a complete deepeval test run file with instrumentation:
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import observe, update_current_trace
@observe()
def my_ai_agent(query: str) -> str:
answer = "Pi rounded to 2 decimal places is 3.14."
update_current_trace(input=query, output=answer)
return answer
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
my_ai_agent(golden.input)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Wrap the top-level function of your LLM app with @observe and call update_current_trace(...) to set the trace-level test case fields. See tracing for the full @observe and update_current_trace surface.
import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
agent = create_agent(
model="openai:gpt-4o-mini",
tools=[],
system_prompt="Answer math questions concisely.",
)
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_app(golden: Golden):
agent.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler()]},
)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Build your agent with create_agent and pass deepeval's CallbackHandler to its invoke method. See the LangChain integration for the full surface.
import pytest
from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, MessagesState, START, END
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
llm = init_chat_model("openai:gpt-4o-mini")
def chatbot(state: MessagesState):
return {"messages": [llm.invoke(state["messages"])]}
graph = (
StateGraph(MessagesState)
.add_node(chatbot)
.add_edge(START, "chatbot")
.add_edge("chatbot", END)
.compile()
)
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_langgraph_app(golden: Golden):
graph.invoke(
{"messages": [{"role": "user", "content": golden.input}]},
config={"callbacks": [CallbackHandler()]},
)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Wire your StateGraph and pass deepeval's CallbackHandler to its invoke method. See the LangGraph integration for the full surface.
import pytest
from deepeval import assert_test
from deepeval.openai import OpenAI
from deepeval.tracing import trace
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
client = OpenAI()
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_openai_app(golden: Golden):
with trace():
client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer in one short sentence."},
{"role": "user", "content": golden.input},
],
)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Drop-in replace from openai import OpenAI with from deepeval.openai import OpenAI. Every chat.completions.create(...), chat.completions.parse(...), and responses.create(...) call becomes an LLM span automatically. See the OpenAI integration for the full surface.
import pytest
from pydantic_ai import Agent
from deepeval import assert_test
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
agent = Agent(
"openai:gpt-5",
system_prompt="Answer in one short sentence.",
instrument=DeepEvalInstrumentationSettings(),
)
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_pydantic_ai_app(golden: Golden):
agent.run_sync(golden.input)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Pass DeepEvalInstrumentationSettings() to your Agent's instrument keyword. See the Pydantic AI integration for the full surface.
import pytest
from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval import assert_test
from deepeval.integrations.agentcore import instrument_agentcore
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_agentcore()
app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@app.entrypoint
def invoke(payload):
result = agent(payload["prompt"])
return {"result": result.message}
@pytest.mark.parametrize("golden", dataset.goldens)
def test_agentcore_app(golden: Golden):
invoke({"prompt": golden.input})
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Call instrument_agentcore() before creating your AgentCore app. The same call also instruments Strands agents running inside AgentCore. See the AgentCore integration for the full surface.
import pytest
from strands import Agent
from strands.models.openai import OpenAIModel
from deepeval import assert_test
from deepeval.integrations.strands import instrument_strands
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_strands()
agent = Agent(
model=OpenAIModel(model_id="gpt-4o-mini"),
system_prompt="You are a helpful assistant.",
)
dataset = EvaluationDataset(goldens=[Golden(input="Help me return my order.")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_strands_agent(golden: Golden):
agent(golden.input)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Call instrument_strands() before creating or invoking your agent. Use this when you run Strands directly; for AgentCore-hosted Strands, use the AgentCore tab. See the Strands integration for the full surface.
import pytest
from deepeval import assert_test
from deepeval.anthropic import Anthropic
from deepeval.tracing import trace
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
client = Anthropic()
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_anthropic_app(golden: Golden):
with trace():
client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="Answer in one short sentence.",
messages=[{"role": "user", "content": golden.input}],
)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Drop-in replace from anthropic import Anthropic with from deepeval.anthropic import Anthropic. Every messages.create(...) call becomes an LLM span automatically. See the Anthropic integration for the full surface.
import asyncio
import pytest
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument
from deepeval import assert_test
from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_llama_index(instrument.get_dispatcher())
agent = FunctionAgent(
tools=[],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="Answer math questions concisely.",
)
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llamaindex_app(golden: Golden):
asyncio.run(agent.run(golden.input))
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Register deepeval's event handler against LlamaIndex's instrumentation dispatcher. See the LlamaIndex integration for the full surface.
import pytest
from agents import Runner, add_trace_processor
from deepeval import assert_test
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
add_trace_processor(DeepEvalTracingProcessor())
agent = Agent(
name="math_agent",
instructions="Answer math questions concisely.",
)
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_openai_agents_app(golden: Golden):
Runner.run_sync(agent, golden.input)
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Register DeepEvalTracingProcessor once, then build your agent with deepeval's Agent shim. See the OpenAI Agents integration for the full surface.
import asyncio
import pytest
from google.adk.agents import LlmAgent
from google.adk.runners import InMemoryRunner
from google.genai import types
from deepeval import assert_test
from deepeval.integrations.google_adk import instrument_google_adk
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_google_adk()
agent = LlmAgent(model="gemini-2.0-flash", name="assistant", instruction="Answer math questions concisely.")
runner = InMemoryRunner(agent=agent, app_name="deepeval-google-adk")
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
async def run_agent(prompt: str) -> str:
session = await runner.session_service.create_session(app_name="deepeval-google-adk", user_id="demo-user")
message = types.Content(role="user", parts=[types.Part(text=prompt)])
async for event in runner.run_async(user_id="demo-user", session_id=session.id, new_message=message):
if event.is_final_response() and event.content:
return "".join(part.text for part in event.content.parts if getattr(part, "text", None))
return ""
@pytest.mark.parametrize("golden", dataset.goldens)
def test_google_adk_app(golden: Golden):
asyncio.run(run_agent(golden.input))
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Call instrument_google_adk() once before building your LlmAgent. See the Google ADK integration for the full surface.
import pytest
from crewai import Task
from deepeval import assert_test
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_crewai()
tutor = Agent(
role="Math Tutor",
goal="Answer math questions accurately and concisely.",
backstory="An experienced tutor who explains simple math clearly.",
)
task = Task(
description="{question}",
expected_output="Pi rounded to 2 decimal places is 3.14.",
agent=tutor,
)
crew = Crew(agents=[tutor], tasks=[task])
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_crewai_app(golden: Golden):
crew.kickoff({"question": golden.input})
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Call instrument_crewai() once, then build your crew with deepeval's Crew and Agent shims. See the CrewAI integration for the full surface.
There are ONE mandatory and ONE optional parameter for assert_test() in this mode:
golden: theGoldenyou pass in through your test function.- [Optional]
metrics: a list ofBaseMetrics that you wish to run on your trace (aka. end-to-end evals).
Without Tracing
Use this when you can't (or don't want to) instrument your app — e.g. a QA engineer evaluating a deployed black-box system. You build the LLMTestCase yourself inside the test and hand it to assert_test() directly. No tracing is involved, so you don't get per-test-case traces in CI.
import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def your_llm_app(query: str) -> str:
return "Pi rounded to 2 decimal places is 3.14."
dataset = EvaluationDataset(goldens=[Golden(input="What is pi rounded to 2 decimal places?")])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
answer = your_llm_app(golden.input)
test_case = LLMTestCase(
input=golden.input,
actual_output=answer,
)
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])There are TWO mandatory and ONE optional parameter for assert_test() in this mode:
test_case: anLLMTestCaseyou constructed inside the test.metrics: a list ofBaseMetrics.
The fields you populate on LLMTestCase must match what your metrics need (e.g. FaithfulnessMetric requires retrieval_context). See test cases for the full parameter list.
Pick this if your app is multi-turn — chatbots, support agents, and any conversational app where the unit of evaluation is the whole conversation rather than a single exchange. You wrap your chatbot in a model_callback, simulate conversations against goldens, then assert_test() each ConversationalTestCase. Multi-turn evaluation is end-to-end by default; for the full standalone walkthrough see the multi-turn end-to-end guide.
1. Wrap your chatbot in a callback
The ConversationSimulator needs a way to ask your chatbot for its next reply, given the conversation so far:
from typing import List
from deepeval.test_case import Turn
async def model_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
response = await your_chatbot(input, turns, thread_id)
return Turn(role="assistant", content=response)from typing import List
from deepeval.test_case import Turn
from openai import OpenAI
client = OpenAI()
async def model_callback(input: str, turns: List[Turn]) -> Turn:
messages = [
{"role": "system", "content": "You are a ticket purchasing assistant"},
*[{"role": t.role, "content": t.content} for t in turns],
{"role": "user", "content": input},
]
response = await client.chat.completions.create(model="gpt-4.1", messages=messages)
return Turn(role="assistant", content=response.choices[0].message.content)from langchain.agents import create_agent
from langgraph.checkpoint.memory import InMemorySaver
from deepeval.test_case import Turn
agent = create_agent(
model="openai:gpt-4o-mini",
system_prompt="You are a ticket purchasing assistant.",
checkpointer=InMemorySaver(),
)
async def model_callback(input: str, thread_id: str) -> Turn:
result = agent.invoke(
{"messages": [{"role": "user", "content": input}]},
config={"configurable": {"thread_id": thread_id}},
)
return Turn(role="assistant", content=result["messages"][-1].content)from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.llms.openai import OpenAI
from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.core.memory import ChatMemoryBuffer
from deepeval.test_case import Turn
chat_store = SimpleChatStore()
llm = OpenAI(model="gpt-4")
async def model_callback(input: str, thread_id: str) -> Turn:
memory = ChatMemoryBuffer.from_defaults(chat_store=chat_store, chat_store_key=thread_id)
chat_engine = SimpleChatEngine.from_defaults(llm=llm, memory=memory)
response = chat_engine.chat(input)
return Turn(role="assistant", content=response.response)from agents import Agent, Runner, SQLiteSession
from deepeval.test_case import Turn
sessions = {}
agent = Agent(name="Test Assistant", instructions="You are a helpful assistant that answers questions concisely.")
async def model_callback(input: str, thread_id: str) -> Turn:
if thread_id not in sessions:
sessions[thread_id] = SQLiteSession(thread_id)
session = sessions[thread_id]
result = await Runner.run(agent, input, session=session)
return Turn(role="assistant", content=result.final_output)from typing import List
from datetime import datetime
from pydantic_ai import Agent
from pydantic_ai.messages import ModelRequest, ModelResponse, UserPromptPart, TextPart
from deepeval.test_case import Turn
agent = Agent('openai:gpt-4', system_prompt="You are a helpful assistant that answers questions concisely.")
async def model_callback(input: str, turns: List[Turn]) -> Turn:
message_history = []
for turn in turns:
if turn.role == "user":
message_history.append(ModelRequest(parts=[UserPromptPart(content=turn.content, timestamp=datetime.now())], kind='request'))
elif turn.role == "assistant":
message_history.append(ModelResponse(parts=[TextPart(content=turn.content)], model_name='gpt-4', timestamp=datetime.now(), kind='response'))
result = await agent.run(input, message_history=message_history)
return Turn(role="assistant", content=result.output)2. Simulate conversations & write your test
Run the simulator once at module load to produce ConversationalTestCases, then parametrize over them:
import pytest
import deepeval
from deepeval import assert_test
from deepeval.test_case import ConversationalTestCase
from deepeval.metrics import TurnRelevancyMetric
from deepeval.conversation_simulator import ConversationSimulator
from your_app import model_callback
simulator = ConversationSimulator(model_callback=model_callback)
test_cases = simulator.simulate(
conversational_goldens=dataset.goldens,
max_user_simulations=10,
)
@pytest.mark.parametrize("test_case", test_cases)
def test_chatbot(test_case: ConversationalTestCase):
assert_test(test_case=test_case, metrics=[TurnRelevancyMetric()])
@deepeval.log_hyperparameters
def hyperparameters():
return {"model": "gpt-4.1", "system_prompt": "Be concise."}There are TWO mandatory and ONE optional parameter for assert_test() in this mode:
test_case: aConversationalTestCaseproduced by the simulator.metrics: a list ofBaseConversationalMetrics. See multi-turn metrics (TurnRelevancyMetric,KnowledgeRetentionMetric,RoleAdherenceMetric,ConversationCompletenessMetric).- [Optional]
run_async: defaults toTrue.
Run with deepeval test run
Whichever flavor you picked above, the command is the same:
deepeval test run test_llm_app.pyYAML File For CI/CD Evals
Drop deepeval test run into a .yml to run your unit tests on every push or PR. This example uses poetry for installation and OPENAI_API_KEY as your LLM judge to run evals locally. Add CONFIDENT_API_KEY to send results to Confident AI.
name: LLM App `deepeval` Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Install Dependencies
run: poetry install --no-root
- name: Run `deepeval` Unit Tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval test run test_llm_app.pyClick here to learn about the optional flags available to deepeval test run.