🔥 DeepEval 4.0 just got released. Read the announcement.
Use Cases

RAG Evaluation Quickstart

Learn to evaluate retrieval-augmented-generation (RAG) pipelines and systems using deepeval, such as RAG QA, summarizaters, and customer support chatbots.

Overview

RAG evaluation involves evaluating the retriever and generator as separately components. This is because in a RAG pipeline, the final output is only as good as the context you've fed into your LLM.

In this 5 min quickstart, you'll learn how to:

  • Evaluate your RAG pipeline end-to-end
  • Test the retriever and generator as separate components
  • Evaluate multi-turn RAG

How It Works

Installation

pip install -U deepeval[inspect]
Do I need the [inspect] sub-module?

The [inspect] sub-module is optional and can bloat up deepeval's package size so we highly recommend that you don't install deepeval with [inspect] outside of dev environments.

You should also run deepeval login:

deepeval login

It connects you to Confident AI so you can store results, annotate, and inspect evaluated agent traces on the cloud.

Unit Test RAG in CI/CD

The fastest way to start evaluating RAG is to unit test it. deepeval plugs into pytest via assert_test() and the deepeval test run command, so failing metrics fail the build — locally in development, and on every push or PR.

Create a dataset

Datasets in deepeval store Goldens — the inputs you'll invoke your RAG pipeline with at test time:

from deepeval.dataset import Golden, EvaluationDataset

goldens = [
    Golden(input="How do I reset my password?"),
    Golden(input="What's your refund policy?"),
    # ...
]

dataset = EvaluationDataset(goldens=goldens)
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
    file_path="example.csv",
    input_col_name="query",
)
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
    file_path="example.json",
    input_key_name="query",
)

Write your test file

Pick based on whether you can modify your RAG pipeline's code:

  • Without tracing — you're testing a deployed or black-box system (e.g. as a QA engineer). Build the LLMTestCase yourself from your pipeline's outputs.
  • With tracing — you own the code. Instrument it once and deepeval builds test cases from traces automatically, with a full trace per test case.
test_rag.py
import pytest
from deepeval import assert_test
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from your_app import rag_pipeline  # returns (answer, retrieved_chunks)

@pytest.mark.parametrize("golden", dataset.goldens)
def test_rag(golden: Golden):
    answer, retrieved_chunks = rag_pipeline(golden.input)
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=answer,
        retrieval_context=retrieved_chunks,
    )
    assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()])

retrieval_context is crucial — RAG metrics like FaithfulnessMetric evaluate against the chunks retrieved at evaluation time.

test_rag.py
import pytest
from deepeval import assert_test
from deepeval.dataset import Golden
from deepeval.tracing import observe, update_current_trace
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

@observe()
def rag_pipeline(query: str) -> str:
    chunks = retrieve(query)  # your retrieval logic, @observe optional
    answer = generate(query, chunks)  # your LLM call, @observe optional
    update_current_trace(input=query, output=answer, retrieval_context=chunks)
    return answer

@pytest.mark.parametrize("golden", dataset.goldens)
def test_rag(golden: Golden):
    rag_pipeline(golden.input)
    assert_test(golden=golden, metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()])

assert_test() only needs the golden — the test case is built from the captured trace. This same setup unlocks retriever & generator evals below.

Run with deepeval test run

deepeval test run test_rag.py
  • Drop this command into a .yml to run on every push or PR — see unit testing in CI/CD for a full pipeline example.
  • Test runs are saved locally — deepeval inspect opens them in a trace-tree TUI with per-span scores and metric reasons.

Evaluate RAG with evaluate()

Prefer a script or notebook over pytest? evaluate() is the same engine as assert_test(), exposed as a function call — and since you build the test cases yourself, it requires zero instrumentation of your RAG pipeline. Ideal for evaluating deployed, black-box systems.

main.py
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from your_app import rag_pipeline  # returns (answer, retrieved_chunks)
...

test_cases = []
for golden in dataset.goldens:
    answer, retrieved_chunks = rag_pipeline(golden.input)
    test_cases.append(
        LLMTestCase(
            input=golden.input,
            actual_output=answer,
            retrieval_context=retrieved_chunks,
        )
    )

evaluate(test_cases=test_cases, metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()])

✅ Done. Each test case is scored against every metric, and the results roll up into a test run — a snapshot of your RAG pipeline's quality at this point in time.

Which RAG metrics should I use?

deepeval offers 5 RAG metrics — generator-focused:

And retriever-focused:

Contextual precision and recall also require an expected_output on your test cases. See the RAG evaluation guide for how to pick.

Evaluate Retriever & Generator

A single end-to-end score can't tell you whether a bad answer came from bad retrieval or bad generation. If you own the code, trace your pipeline's components and attach metrics directly to them:

main.py
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_span
from deepeval.metrics import ContextualRelevancyMetric, AnswerRelevancyMetric

@observe(metrics=[ContextualRelevancyMetric()])
def retriever(query: str) -> list[str]:
    chunks = ["..."]  # your retrieval logic here
    update_current_span(test_case=LLMTestCase(input=query, retrieval_context=chunks))
    return chunks

@observe(metrics=[AnswerRelevancyMetric()])
def generator(query: str, chunks: list[str]) -> str:
    answer = "..."  # your LLM call here
    update_current_span(test_case=LLMTestCase(input=query, actual_output=answer))
    return answer

@observe()
def rag_pipeline(query: str) -> str:
    chunks = retriever(query)
    return generator(query, chunks)

dataset = EvaluationDataset(goldens=[Golden(input="How do I reset my password?")])

for golden in dataset.evals_iterator():
    rag_pipeline(golden.input)

✅ Done. A plain for loop is all it takes:

  • Retriever span — scored by retriever metrics like ContextualRelevancyMetric against its retrieval_context.
  • Generator span — scored by generator metrics like AnswerRelevancyMetric against its actual_output.
  • Want end-to-end too? Pass metrics=[...] to evals_iterator() — trace and span scores coexist in one test run.
  • In CI/CD? Swap the loop for assert_test(golden=golden) inside a pytest test — metrics stay on the spans.

See component-level evaluation for the full surface, including framework integrations.

Evaluate Multi-Turn RAG

For chatbots that rely on RAG — like customer support bots — the unit of evaluation is the whole conversation, and each assistant turn carries its own retrieval_context.

Create a dataset

Multi-turn datasets store ConversationalGoldens — scenarios to simulate, not pre-written turns:

from deepeval.dataset import EvaluationDataset, ConversationalGolden

dataset = EvaluationDataset(goldens=[
    ConversationalGolden(
        scenario="User can't log in and wants to reset their password.",
        expected_outcome="Receives clear reset steps grounded in the help docs.",
    ),
])
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Multi-Turn Dataset")
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_csv_file(
    file_path="example.csv",
    scenario_col_name="scenario",
    expected_outcome_col_name="expected_outcome",
)
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_goldens_from_json_file(
    file_path="example.json",
    scenario_key_name="scenario",
    expected_outcome_key_name="expected_outcome",
)

Simulate conversations

Wrap your RAG chatbot in a model_callback that returns each reply as a Turn — including its retrieval_context — then let ConversationSimulator generate the conversations:

main.py
from typing import List
from deepeval.test_case import Turn
from deepeval.simulator import ConversationSimulator

async def model_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
    # Replace with your RAG chatbot; returns (answer, retrieved_chunks)
    answer, retrieved_chunks = await your_rag_chatbot(input, turns, thread_id)
    return Turn(role="assistant", content=answer, retrieval_context=retrieved_chunks)

simulator = ConversationSimulator(model_callback=model_callback)
test_cases = simulator.simulate(
    conversational_goldens=dataset.goldens,
    max_user_simulations=10,
)

The retrieval_context on each assistant turn is what lets RAG metrics check grounding turn-by-turn.

Run an evaluation

Score the simulated conversations with multi-turn RAG metrics:

main.py
from deepeval import evaluate
from deepeval.metrics import TurnFaithfulnessMetric, TurnContextualRelevancyMetric
...

evaluate(test_cases=test_cases, metrics=[TurnFaithfulnessMetric(), TurnContextualRelevancyMetric()])

✅ Done. Each assistant turn is evaluated against the chunks it retrieved:

  • TurnFaithfulnessMetric — does each reply stay grounded in its retrieval_context?
  • TurnContextualRelevancyMetric — were the retrieved chunks relevant to that point in the conversation?
  • Add general conversational metrics like TurnRelevancyMetric too — see the chatbot quickstart for the full multi-turn surface.

Next Steps

Now that you have run your first RAG evals, you should:

  1. Customize your metrics: Include all 5 RAG metrics based on your use case.
  2. Prepare a dataset: If you don't have one, generate one as a starting point.
  3. Enable evals in production: Just replace metrics in @observe with a metric_collection string on Confident AI.

You'll be able to analyze performance over time on threads this way, and add them back to your evals dataset for further evaluation.

On this page