Skip to main content

End-to-End LLM Evaluation

End-to-end evaluation assesses the "observable" inputs and outputs of your LLM application - it is what users see, and treats your LLM application as a black-box.

ok

If you're logged into Confident AI, you'll also receive a fully sharable LLM testing report on the cloud. Run this in the CLI:

deepeval login
When should you run End-to-End evaluations?

For simple LLM applications like basic RAG pipelines with "flat" architectures that can be represented by a single LLMTestCase, end-to-end evaluation is ideal. Common use cases that are suitable for end-to-end evaluation include (not inclusive):

  • RAG QA
  • PDF extraction
  • Writing assitants
  • Summarization
  • etc.

You'll notice that use cases with simplier architectures are more suited for end-to-end evaluation. However, if your system is an extremely complex agentic workflow, you might also find end-to-end evaluation more suitable as you'll might conclude that that component-level evaluation gives you too much noise in its evaluation results.

Most of what you saw in DeepEval's quickstart is end-to-end evaluation!

Prerequisites

Select metrics

You should first read the metrics section to understand which metrics are suitable for your use case, but the general rule of thumb is to include no more than 5 metrics, with 2-3 system specific, generic metrics and 1-2 use case specific, custom metrics. If you're unsure, feel free to ask the team and get some recommendations in discord.

Setup LLM application

You'll need to setup your LLM application to return the test case parameters required by the metrics you've chosen. Alternatively, setup LLM tracing to avoid making changes to your LLM app.

Guidelines to set up your LLM application

You'll need to make sure your application returns all fields required by your selected metrics, in order to create a valid end-to-end LLMTestCase. For example, if you’re using AnswerRelevancyMetric and FaithfulnessMetric, your application must return:

  • input
  • actual_output
  • retrieval_context

because both metrics require input and actual_output, and FaithfulnessMetric also requires retrieval_context.

If you cannot make changes to your LLM app, you should set up tracing, which also allows you to run and debug end-to-end evaluations on Confident AI.

We'll be using this LLM application in this example which has a simple, "flat" RAG architecture to demonstrate how to run end-to-end evaluations on it using deepeval:

somewhere.py
from typing import List
from openai import OpenAI

client = OpenAI()

def your_llm_app(input: str):
def retriever(input: str):
return ["Hardcoded text chunks from your vector database"]

def generator(input: str, retrieved_chunks: List[str]):
res = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use the provided context to answer the question."},
{"role": "user", "content": "\n\n".join(retrieved_chunks) + "\n\nQuestion: " + input}
]
).choices[0].message.content
return res

retrieval_context = retriever(input)
return generator(input, retrieval_context), retrieval_context


print(your_llm_app("How are you?"))

Run End-to-End Evals

Running an end-to-end LLM evaluation creates a test run — a collection of test cases that benchmarks your LLM application at a specific point in time. You would typically:

  • Loop through a list of Goldens
  • Invoke your LLM app with each golden’s input
  • Generate a set of test cases ready for evaluation

You can run end-to-end LLM evaluations in either:

  • Python scripts using the evaluate() function, or
  • CI/CD pipelines using deepeval test run

Both gives you exactly the same functionality, and integrates 100% with Confident AI for sharable testing reports on the cloud.

Use evaluate() in Python scripts

deepeval offers an evaluate() function that allows you to evaluate end-to-end LLM interactions through a list of test cases and metrics. Each test case will be evaluated by each and every metric you define in metrics, and a test case passes only if all metrics passes.

main.py
from somewhere import your_llm_app # Replace with your LLM app

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate

dataset = EvaluationDataset(goldens=[Golden(input="...")])

# Create test cases from goldens
for golden in dataset.goldens:
res, text_chunks = your_llm_app(golden.input)
test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
dataset.add_test_case(test_case)

# Evaluate end-to-end
evaluate(test_cases=dataset.test_case, metrics=[AnswerRelevancyMetric()])

There are TWO mandatory and SIX optional parameters when calling the evaluate() function for END-TO-END evaluation:

  • test_cases: a list of LLMTestCases OR ConversationalTestCases, or an EvaluationDataset. You cannot evaluate LLMTestCase/MLLMTestCases and ConversationalTestCases in the same test run.
  • metrics: a list of metrics of type BaseMetric.
  • [Optional] hyperparameters: a dict of type dict[str, Union[str, int, float]]. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI.
  • [Optional] identifier: a string that allows you to better identify your test run on Confident AI.
  • [Optional] async_config: an instance of type AsyncConfig that allows you to customize the degree of concurrency during evaluation. Defaulted to the default AsyncConfig values.
  • [Optional] display_config:an instance of type DisplayConfig that allows you to customize what is displayed to the console during evaluation. Defaulted to the default DisplayConfig values.
  • [Optional] error_config: an instance of type ErrorConfig that allows you to customize how to handle errors during evaluation. Defaulted to the default ErrorConfig values.
  • [Optional] cache_config: an instance of type CacheConfig that allows you to customize the caching behavior during evaluation. Defaulted to the default CacheConfig values.

This is exactly the same as assert_test() in deepeval test run, but in a difference interface.

Use deepeval test run in CI/CD pipelines

test_llm_app.py
from somewhere import your_llm_app # Replace with your LLM app
import pytest

from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval import assert_test

goldens = [Golden(input="...")]

# Loop through goldens using pytest

@pytest.mark.parametrize("golden", goldens)
def test_llm_app(golden: Golden):
res, text_chunks = your_llm_app(golden.input)
test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])

Then, run the following command in your CLI:

deepeval test run test_llm_app.py

There are TWO mandatory and ONE optional parameter when calling the assert_test() function for END-TO-END evaluation:

  • test_case: an LLMTestCase.
  • metrics: a list of metrics of type BaseMetric.
  • [Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics in @observe. Defaulted to True.

Click here to learn about different optional flags available to deepeval test run to customize asynchronous behaviors, error handling, etc.

caution

The usual pytest command would still work but is highly not recommended. deepeval test run adds a range of functionalities on top of Pytest for unit-testing LLMs, which is enabled by 8+ optional flags. Users typically include deepeval test run as a command in their .yaml files for pre-deployment checks in CI/CD pipelines (example here).