Unit Testing in CI/CD
Integrate LLM evaluations into your CI/CD pipeline with deepeval
to catch regressions and ensure reliable performance. You can use deepeval
with your CI/CD pipelines to run both end-to-end and component level evaluations.
deepeval
allows you to run evaluations as if you're using pytest
via our Pytest integration.
End-to-End Evals in CI/CD
Run tests against your LLM app using golden datasets for every push you make. End-to-end evaluations validate overall behavior across single-turn and multi-turn interactions. Perfect for catching regressions before deploying to production.
Single-Turn Evals
Load your dataset
deepeval
offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset
as either test cases or goldens.
- Confident AI
- From CSV
- From JSON
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
# file_path is the absolute path to you .csv file
file_path="example.csv",
input_col_name="query"
)
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
# file_path is the absolute path to you .json file
file_path="example.json",
input_key_name="query"
)
Assert your tests
You can use deepeval
's assert_test
function to write test files.
from your_agent import your_llm_app # Replace with your LLM app
import pytest
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
# Loop through goldens using pytest
@pytest.mark.parametrize("golden",dataset.goldens)
def test_llm_app(golden: Golden):
res, text_chunks = your_llm_app(golden.input)
test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
return {"model": "gpt-4.1", "system prompt": "..."}
Then, run the following command in your CLI:
deepeval test run test_llm_app.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function for END-TO-END evaluation:
test_case
: anLLMTestCase
.metrics
: a list of metrics of typeBaseMetric
.- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics in@observe
. Defaulted toTrue
.
Create a YAML file to execute your test file automatically in CI/CD pipelines. Click here for an example YAML file.
Multi-Turn Evals
Wrap chatbot in callback
You need to define a chatbot callback to generate synthetic test cases from goldens using the ConversationSimulator
. So, define a callback function to generate the next chatbot response in a conversation, given the conversation history.
- Python
- OpenAI
- LangChain
- LlamaIndex
- OpenAI Agents
- Pydantic
from deepeval.test_case import Turn
async def model_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
# Replace with your chatbot
response = await your_chatbot(input, turns, thread_id)
return Turn(role="assistant", content=response)
from deepeval.test_case import Turn
from openai import OpenAI
client = OpenAI()
async def model_callback(input: str, turns: List[Turn]) -> str:
messages = [
{"role": "system", "content": "You are a ticket purchasing assistant"},
*[{"role": t.role, "content": t.content} for t in turns],
{"role": "user", "content": input},
]
response = await client.chat.completions.create(model="gpt-4.1", messages=messages)
return Turn(role="assistant", content=response.choices[0].message.content)
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
store = {}
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_messages([("system", "You are a ticket purchasing assistant."), MessagesPlaceholder(variable_name="history"), ("human", "{input}")])
chain_with_history = RunnableWithMessageHistory(prompt | llm, lambda session_id: store.setdefault(session_id, ChatMessageHistory()), input_messages_key="input", history_messages_key="history")
async def model_callback(input: str, thread_id: str) -> Turn:
response = chain_with_history.invoke(
{"input": input},
config={"configurable": {"session_id": thread_id}}
)
return Turn(role="assistant", content=response.content)
from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.llms.openai import OpenAI
from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.core.memory import ChatMemoryBuffer
chat_store = SimpleChatStore()
llm = OpenAI(model="gpt-4")
async def model_callback(input: str, thread_id: str) -> Turn:
memory = ChatMemoryBuffer.from_defaults(chat_store=chat_store, chat_store_key=thread_id)
chat_engine = SimpleChatEngine.from_defaults(llm=llm, memory=memory)
response = chat_engine.chat(input)
return Turn(role="assistant", content=response.response)
from agents import Agent, Runner, SQLiteSession
sessions = {}
agent = Agent(name="Test Assistant", instructions="You are a helpful assistant that answers questions concisely.")
async def model_callback(input: str, thread_id: str) -> Turn:
if thread_id not in sessions:
sessions[thread_id] = SQLiteSession(thread_id)
session = sessions[thread_id]
result = await Runner.run(agent, input, session=session)
return Turn(role="assistant", content=result.final_output)
from pydantic_ai.messages import ModelRequest, ModelResponse, UserPromptPart, TextPart
from deepeval.test_case import Turn
from datetime import datetime
from pydantic_ai import Agent
from typing import List
agent = Agent('openai:gpt-4', system_prompt="You are a helpful assistant that answers questions concisely.")
async def model_callback(input: str, turns: List[Turn]) -> Turn:
message_history = []
for turn in turns:
if turn.role == "user":
message_history.append(ModelRequest(parts=[UserPromptPart(content=turn.content, timestamp=datetime.now())], kind='request'))
elif turn.role == "assistant":
message_history.append(ModelResponse(parts=[TextPart(content=turn.content)], model_name='gpt-4', timestamp=datetime.now(), kind='response'))
result = await agent.run(input, message_history=message_history)
return Turn(role="assistant", content=result.output)
Your model callback should accept an input
, and optionally turns
and thread_id
. It should return a Turn
object.
Load your dataset
deepeval
offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset
as either test cases or goldens.
- Confident AI
- From CSV
- From JSON
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
# file_path is the absolute path to you .csv file
file_path="example.csv",
input_col_name="query"
)
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
# file_path is the absolute path to you .json file
file_path="example.json",
input_key_name="query"
)
Assert your tests
You can use deepeval
's assert_test
function to write test files.
from main import chatbot_callback # Replace with your LLM callback
import pytest
from deepeval.dataset import Golden
from deepeval.test_case import ConversationalTestCase
from deepeval import assert_test
from deepeval.conversation_simulator import ConversationSimulator
# Loop through goldens using pytest
simulator = ConversationSimulator(model_callback=chatbot_callback)
conversational_test_cases = simulator.simulate(goldens=dataset.goldens, max_turns=10)
@pytest.mark.parametrize("test_case", conversational_test_cases)
def test_llm_app(test_case: ConversationalTestCase):
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
return {"model": "gpt-4.1", "system prompt": "..."}
Then, run the following command in your CLI:
deepeval test run test_llm_app.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function for END-TO-END evaluation:
test_case
: anLLMTestCase
.metrics
: a list of metrics of typeBaseMetric
.- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics in@observe
. Defaulted toTrue
.
Create a YAML file to execute your test file automatically in CI/CD pipelines. Click here for an example YAML file.
The usual pytest
command would still work but is highly not recommended. deepeval test run
adds a range of functionalities on top of Pytest for unit-testing LLMs, which is enabled by 8+ optional flags. Users typically include deepeval test run
as a command in their .yaml
files for pre-deployment checks in CI/CD pipelines (example here).
Click here to learn about different optional flags available to deepeval test run
to customize asynchronous behaviors, error handling, etc.
Component-Level Evals in CI/CD
Test individual parts of your LLM pipeline like prompt templates or retrieval logic in isolation. Component-level evals offer fast, targeted feedback and integrate seamlessly into your CI/CD workflows.
Load your dataset
deepeval
offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset
as either test cases or goldens.
- Confident AI
- From CSV
- From JSON
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
# file_path is the absolute path to you .csv file
file_path="example.csv",
input_col_name="query"
)
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
# file_path is the absolute path to you .json file
file_path="example.json",
input_key_name="query"
)
Assert your tests
You can use deepeval
's assert_test
function to write test files.
import pytest
from your_agent import your_llm_app # Replace with your LLM app
from deepeval import assert_test
from deepeval.dataset import Golden
# Loop through goldens in our dataset using pytest
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
assert_test(golden=golden, observed_callback=your_llm_app)
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
return {"model": "gpt-4.1", "system prompt": "..."}
Finally, don't forget to run the test file in the CLI:
deepeval test run test_llm_app.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function for COMPONENT-LEVEL evaluation:
golden
: theGolden
that you wish to invoke yourobserved_callback
with.observed_callback
: a function callback that is your@observe
decorated LLM application. There must be AT LEAST ONE metric within one of themetrics
in your@observe
decorated LLM application.- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics in@observe
. Defaulted toTrue
.
Create a YAML file to execute your test file automatically in CI/CD pipelines. Click here for an example YAML file.
Similar to the evaluate()
function, assert_test()
for component-level evaluation does not need:
- Declaration of
metrics
because those are defined at the span level in themetrics
parameter. - Creation of
LLMTestCase
s because it is handled at runtime byupdate_current_span
in your LLM app.
YAML File For CI/CD Evals
To run your unit tests on all changes in prod, you can use the following YAML
file in your github actions or any other similar CI/CD pipelines. This example uses poetry
for installation, OPENAI_API_KEY
as your LLM judge to run evals locally. You can also optionally add CONFIDENT_API_KEY
to send results to Confident AI.
name: LLM App DeepEval Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Install Dependencies
run: poetry install --no-root
- name: Run DeepEval Unit Tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval test run test_llm_app.py
Click here to learn about different optional flags available to deepeval test run
to customize asynchronous behaviors, error handling, etc.
We highly recommend setting up Confident AI with your deepeval
evaluations to get professional test reports and observe trends of your LLM application's performance overtime like this: