Skip to main content

Unit Testing in CI/CD

Integrate LLM evaluations into your CI/CD pipeline with deepeval to catch regressions and ensure reliable performance. You can use deepeval with your CI/CD pipelines to run both end-to-end and component level evaluations.

deepeval allows you to run evaluations as if you're using pytest via our Pytest integration.

End-to-End Evals in CI/CD

Run tests against your LLM app using golden datasets for every push you make. End-to-end evaluations validate overall behavior across single-turn and multi-turn interactions. Perfect for catching regressions before deploying to production.

Single-Turn Evals

Load your dataset

deepeval offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset as either test cases or goldens.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")

You can learn more about loading datasets here.

Assert your tests

You can use deepeval's assert_test function to write test files.

test_llm_app.py
from your_agent import your_llm_app # Replace with your LLM app
import pytest

from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval import assert_test

# Loop through goldens using pytest

@pytest.mark.parametrize("golden",dataset.goldens)
def test_llm_app(golden: Golden):
res, text_chunks = your_llm_app(golden.input)
test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])

@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
return {"model": "gpt-4.1", "system prompt": "..."}

Then, run the following command in your CLI:

deepeval test run test_llm_app.py

There are TWO mandatory and ONE optional parameter when calling the assert_test() function for END-TO-END evaluation:

  • test_case: an LLMTestCase.
  • metrics: a list of metrics of type BaseMetric.
  • [Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics in @observe. Defaulted to True.

Create a YAML file to execute your test file automatically in CI/CD pipelines. Click here for an example YAML file.

Multi-Turn Evals

Wrap chatbot in callback

You need to define a chatbot callback to generate synthetic test cases from goldens using the ConversationSimulator. So, define a callback function to generate the next chatbot response in a conversation, given the conversation history.

main.py
from deepeval.test_case import Turn

async def model_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
# Replace with your chatbot
response = await your_chatbot(input, turns, thread_id)
return Turn(role="assistant", content=response)
info

Your model callback should accept an input, and optionally turns and thread_id. It should return a Turn object.

Load your dataset

deepeval offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset as either test cases or goldens.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")

You can learn more about loading datasets here.

Assert your tests

You can use deepeval's assert_test function to write test files.

test_llm_app.py
from main import chatbot_callback # Replace with your LLM callback
import pytest

from deepeval.dataset import Golden
from deepeval.test_case import ConversationalTestCase
from deepeval import assert_test
from deepeval.conversation_simulator import ConversationSimulator

# Loop through goldens using pytest

simulator = ConversationSimulator(model_callback=chatbot_callback)
conversational_test_cases = simulator.simulate(goldens=dataset.goldens, max_turns=10)

@pytest.mark.parametrize("test_case", conversational_test_cases)
def test_llm_app(test_case: ConversationalTestCase):
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])

@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
return {"model": "gpt-4.1", "system prompt": "..."}

Then, run the following command in your CLI:

deepeval test run test_llm_app.py

There are TWO mandatory and ONE optional parameter when calling the assert_test() function for END-TO-END evaluation:

  • test_case: an LLMTestCase.
  • metrics: a list of metrics of type BaseMetric.
  • [Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics in @observe. Defaulted to True.

Create a YAML file to execute your test file automatically in CI/CD pipelines. Click here for an example YAML file.

caution

The usual pytest command would still work but is highly not recommended. deepeval test run adds a range of functionalities on top of Pytest for unit-testing LLMs, which is enabled by 8+ optional flags. Users typically include deepeval test run as a command in their .yaml files for pre-deployment checks in CI/CD pipelines (example here).

Click here to learn about different optional flags available to deepeval test run to customize asynchronous behaviors, error handling, etc.

Component-Level Evals in CI/CD

Test individual parts of your LLM pipeline like prompt templates or retrieval logic in isolation. Component-level evals offer fast, targeted feedback and integrate seamlessly into your CI/CD workflows.

Load your dataset

deepeval offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset as either test cases or goldens.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")

You can learn more about loading datasets here.

Assert your tests

You can use deepeval's assert_test function to write test files.

test_llm_app.py
import pytest
from your_agent import your_llm_app # Replace with your LLM app

from deepeval import assert_test
from deepeval.dataset import Golden

# Loop through goldens in our dataset using pytest
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llm_app(golden: Golden):
assert_test(golden=golden, observed_callback=your_llm_app)

@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
return {"model": "gpt-4.1", "system prompt": "..."}

Finally, don't forget to run the test file in the CLI:

deepeval test run test_llm_app.py

There are TWO mandatory and ONE optional parameter when calling the assert_test() function for COMPONENT-LEVEL evaluation:

  • golden: the Golden that you wish to invoke your observed_callback with.
  • observed_callback: a function callback that is your @observe decorated LLM application. There must be AT LEAST ONE metric within one of the metrics in your @observe decorated LLM application.
  • [Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics in @observe. Defaulted to True.

Create a YAML file to execute your test file automatically in CI/CD pipelines. Click here for an example YAML file.

info

Similar to the evaluate() function, assert_test() for component-level evaluation does not need:

  • Declaration of metrics because those are defined at the span level in the metrics parameter.
  • Creation of LLMTestCases because it is handled at runtime by update_current_span in your LLM app.

YAML File For CI/CD Evals

To run your unit tests on all changes in prod, you can use the following YAML file in your github actions or any other similar CI/CD pipelines. This example uses poetry for installation, OPENAI_API_KEY as your LLM judge to run evals locally. You can also optionally add CONFIDENT_API_KEY to send results to Confident AI.

name: LLM App DeepEval Tests

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Install Dependencies
run: poetry install --no-root

- name: Run DeepEval Unit Tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval test run test_llm_app.py

Click here to learn about different optional flags available to deepeval test run to customize asynchronous behaviors, error handling, etc.

tip

We highly recommend setting up Confident AI with your deepeval evaluations to get professional test reports and observe trends of your LLM application's performance overtime like this:

Span-Level Evals in Production