End-to-End LLM Evaluation
End-to-end evaluation assesses the "observable" inputs and outputs of your LLM application - it is what users see, and treats your LLM application as a black-box.
If you're logged into Confident AI, you'll also receive a fully sharable LLM testing report on the cloud. Run this in the CLI:
deepeval login
When should you run End-to-End evaluations?
For simple LLM applications like basic RAG pipelines with "flat" architectures
that can be represented by a single LLMTestCase
, end-to-end
evaluation is ideal. Common use cases that are suitable for end-to-end
evaluation include (not inclusive):
- RAG QA
- PDF extraction
- Writing assitants
- Summarization
- etc.
You'll notice that use cases with simplier architectures are more suited for end-to-end evaluation. However, if your system is an extremely complex agentic workflow, you might also find end-to-end evaluation more suitable as you'll might conclude that that component-level evaluation gives you too much noise in its evaluation results.
Most of what you saw in DeepEval's quickstart is end-to-end evaluation!
Prerequisites
Select metrics
You should first read the metrics section to understand which metrics are suitable for your use case, but the general rule of thumb is to include no more than 5 metrics, with 2-3 system specific, generic metrics and 1-2 use case specific, custom metrics. If you're unsure, feel free to ask the team and get some recommendations in discord.
Setup LLM application
You'll need to setup your LLM application to return the test case parameters required by the metrics you've chosen. Alternatively, setup LLM tracing to avoid making changes to your LLM app.
Guidelines to set up your LLM application
You'll need to make sure your application returns all fields required by your selected metrics, in order to create a valid end-to-end LLMTestCase
. For example, if you’re using AnswerRelevancyMetric
and FaithfulnessMetric
, your application must return:
input
actual_output
retrieval_context
because both metrics require input
and actual_output
, and FaithfulnessMetric
also requires retrieval_context
.
If you cannot make changes to your LLM app, you should set up tracing, which also allows you to run and debug end-to-end evaluations on Confident AI.
We'll be using this LLM application in this example which has a simple, "flat" RAG architecture to demonstrate how to run end-to-end evaluations on it using deepeval
:
from typing import List
from openai import OpenAI
client = OpenAI()
def your_llm_app(input: str):
def retriever(input: str):
return ["Hardcoded text chunks from your vector database"]
def generator(input: str, retrieved_chunks: List[str]):
res = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use the provided context to answer the question."},
{"role": "user", "content": "\n\n".join(retrieved_chunks) + "\n\nQuestion: " + input}
]
).choices[0].message.content
return res
retrieval_context = retriever(input)
return generator(input, retrieval_context), retrieval_context
print(your_llm_app("How are you?"))
Run End-to-End Evals
Running an end-to-end LLM evaluation creates a test run — a collection of test cases that benchmarks your LLM application at a specific point in time. You would typically:
- Loop through a list of
Golden
s - Invoke your LLM app with each golden’s
input
- Generate a set of test cases ready for evaluation
You can run end-to-end LLM evaluations in either:
- Python scripts using the
evaluate()
function, or - CI/CD pipelines using
deepeval test run
Both gives you exactly the same functionality, and integrates 100% with Confident AI for sharable testing reports on the cloud.
Use evaluate()
in Python scripts
deepeval
offers an evaluate()
function that allows you to evaluate end-to-end LLM interactions through a list of test cases and metrics. Each test case will be evaluated by each and every metric you define in metrics
, and a test case passes only if all metrics
passes.
- Python
- OpenAI
from somewhere import your_llm_app # Replace with your LLM app
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
dataset = EvaluationDataset(goldens=[Golden(input="...")])
# Create test cases from goldens
for golden in dataset.goldens:
res, text_chunks = your_llm_app(golden.input)
test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
dataset.add_test_case(test_case)
# Evaluate end-to-end
evaluate(test_cases=dataset.test_case, metrics=[AnswerRelevancyMetric()])
from deepeval.openai import OpenAI # import OpenAI from deepeval instead
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
dataset = EvaluationDataset(goldens=[Golden(input="...")])
client = OpenAI()
# Loop through dataset
for golden in dataset.evals_iterator():
client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": golden.input}
],
metrics=[AnswerRelevancyMetric()]
)
There are TWO mandatory and SIX optional parameters when calling the evaluate()
function for END-TO-END evaluation:
test_cases
: a list ofLLMTestCase
s ORConversationalTestCase
s, or anEvaluationDataset
. You cannot evaluateLLMTestCase
/MLLMTestCase
s andConversationalTestCase
s in the same test run.metrics
: a list of metrics of typeBaseMetric
.- [Optional]
hyperparameters
: a dict of typedict[str, Union[str, int, float]]
. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI. - [Optional]
identifier
: a string that allows you to better identify your test run on Confident AI. - [Optional]
async_config
: an instance of typeAsyncConfig
that allows you to customize the degree of concurrency during evaluation. Defaulted to the defaultAsyncConfig
values. - [Optional]
display_config
:an instance of typeDisplayConfig
that allows you to customize what is displayed to the console during evaluation. Defaulted to the defaultDisplayConfig
values. - [Optional]
error_config
: an instance of typeErrorConfig
that allows you to customize how to handle errors during evaluation. Defaulted to the defaultErrorConfig
values. - [Optional]
cache_config
: an instance of typeCacheConfig
that allows you to customize the caching behavior during evaluation. Defaulted to the defaultCacheConfig
values.
This is exactly the same as assert_test()
in deepeval test run
, but in a difference interface.
Use deepeval test run
in CI/CD pipelines
from somewhere import your_llm_app # Replace with your LLM app
import pytest
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
goldens = [Golden(input="...")]
# Loop through goldens using pytest
@pytest.mark.parametrize("golden", goldens)
def test_llm_app(golden: Golden):
res, text_chunks = your_llm_app(golden.input)
test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])
Then, run the following command in your CLI:
deepeval test run test_llm_app.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function for END-TO-END evaluation:
test_case
: anLLMTestCase
.metrics
: a list of metrics of typeBaseMetric
.- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics in@observe
. Defaulted toTrue
.
Click here to learn about different optional flags available to deepeval test run
to customize asynchronous behaviors, error handling, etc.
The usual pytest
command would still work but is highly not recommended. deepeval test run
adds a range of functionalities on top of Pytest for unit-testing LLMs, which is enabled by 8+ optional flags. Users typically include deepeval test run
as a command in their .yaml
files for pre-deployment checks in CI/CD pipelines (example here).