Running LLM-Evals
Quick Summary
Running an LLM evaluation creates a test run — a collection of test cases that benchmarks your LLM application at a specific point in time. Typically, you loop through a list of Golden
s, invoke your LLM app with each golden’s input
, and generate a set of test cases ready for evaluation. Once the evaluation metrics have been applied to your test cases, you get a completed test run.
If you're logged into Confident AI, you'll also receive a fully sharable LLM testing report on the cloud.
There are FOUR ways you can run LLM-evals in deepeval
:
Using
deepeval test run
in CI/CD Pipelines:- With Tracing
- Without Tracing
Using the
evaluate()
function:- With Tracing
- Without Tracing
For simple LLM applications like basic RAG pipelines with "flat" architectures, tracing might be an overkill. However, if your application already has nested components or a structure that a simple LLMTestCase
can't easily handle, we recommend setting up tracing in deepeval
to apply different metrics to different components in your LLM application.
Setting up tracing lets you evaluate different parts of your app without needing to rewrite your codebase or manually pass intermediate variables just to create LLMTestCase
s, and it solves these common issues you may have already encountered:
Manual code changes: You often need to expose or modify internal variables across many layers just to capture outputs for evaluation.
Limited visibility: It's hard to measure individual components (e.g., retrieval, re-ranking, reasoning) without tightly coupling evaluation logic into your code, and you might end up indexing evaluation results by the name of the component you wish to unit-test.
Tracing in deepeval
also does not affect production code, unless you wish to also run online evaluations in real-time and monitor your LLM app on Confident AI.
Setup Tracing (highly recommended)
deepeval
offers an @observe
decorator for you to apply metrics at any point in your LLM app to evaluate any LLM interaction, no matter how complex they may be, and we recommend everyone to do it. Tracing in deepeval
has these benefits:
Apply metrics flexibly across components: Tracing lets you attach
LLMTestCase
s at any level—whether system-wide or deep inside a nested flow—so you can run targeted metrics on specific components without restructuring your code.Does not affect production code: If you're worried that tracing will affect your LLM calls in productioin, it won't. This is because you can simply disable all
@observe
functionality indeepeval
.Evaluate complex systems easily: Modern LLM apps span retrieval, tool use, and orchestration across different parts of your system. Tracing captures outputs from anywhere in your code without manual wiring.
Quick and seamless integration:
- No major code changes required
- No new concepts to learn
- Setup takes about 3 minutes
Creating an LLMTestCase
for a nested component through tracing is as simple as using the @observe
decorator:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.tracing import observe, update_current_span_test_case
@observe(metrics=[AnswerRelevancyMetric()])
def complete(query: str):
response = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}]).choices[0].message["content"]
update_current_span_test_case(
test_case=LLMTestCase(input=query, output=response)
)
return response
Each metric in metrics
is evaluated using exactly the same algorithm, require the same LLMTestCase
parameters, and with the same configurations, as you would expect when running evaluations without tracing.
Terminologies
There are two terminologies you need to know before setting up tracing:
- Trace: The overall execution flow of your LLM application
- Span: Individual components or units of work within your application (e.g., LLM calls, tool executions, retrievals)
A span can contain many child spans, forming a tree structure—just like how different components of your LLM application interact. As you'll see in the next section, you apply metrics at the span level to evaluate specific components, because each span represents a component of your application.
Using the @observe
Decorator
The @observe
decorator creates spans, and a call to your LLM application decorated by the @observe
decorator creates a trace with many spans. This is how you would use @observe
:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span_test_case
@observe(metrics=[AnswerRelevancyMetric()])
def complete(query: str):
response = openai.ChatCompletion.create(model="gpt-4o", messages=[{"role": "user", "content": query}]).choices[0].message["content"]
update_current_span_test_case(
test_case=LLMTestCase(input=query, output=response)
)
return response
There are ZERO mandatory and THREE optional parameters when using the @observe
decorator:
- [Optional]
type
: The type of span. Anything other than"llm"
,"retriever"
,"tool"
, and"agent"
is a custom span type. - [Optional]
name
: A string specifying how this custom span is displayed on Confident AI. Defaulted to the name of the decorated function. - [Optional]
metrics
: A list of strings specifying the names of theBaseMetric
you wish to run upon tracing indeepeval
. Defaulted toNone
.
Although the metrics
parameter is optional, to run an evaluation you MUST:
- Supply a list of
metrics
- Call
update_current_span_test_case
to create anLLMTestCase
to evaluate the LLM interaction in the current span
If you simply decorate your LLM application with @observe
and don't supply any arguments, nothing will happen at all. The metrics
parameter is optional because some users might want to use tracing only for the debugging UI on Confident AI, and not necessarily run evaluations on individual components.
For simplicity, we always recommend custom spans unless needed otherwise, since metrics
only care about the scope of the span, and supplying a specified type
is most useful only when using Confident AI. To summarize:
- Specifying a span
span
(like"llm"
) allows you to supply additional parameters in the@observe
signature (e.g., themodel
used). - This information becomes extremely useful for analysis and visualization if you're using
deepeval
together with Confident AI (highly recommended). - Otherwise, for local evaluation purposes, span
type
makes no difference — evaluation still works the same way.
To learn more about the different spans type
s, or to run LLM evaluations with tracing with an UI for visualization and debugging, visitng the official Confident AI docs on LLM tracing.
Full Example
In this example, going to evaluate the "RAG Pipeline"
component in our "Research Agent"
using the ContextualRelevancyMetric
by setting up tracing in deepeval
with the @observe
decorator:
This is the same example we used in the test cases section
Assuming the code implementation of this LLM agent, the codeblock below shows it only took an additional SEVEN LINES OF CODE to setup tracing:
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.tracing import (
observe,
update_current_span_test_case,
ContextualRelevancyMetric,
)
def web_search(query: str) -> str:
# <--Include implementation to search web here-->
return "Latest search results for: " + query
def retrieve_documents(query: str) -> List[str]:
# <--Include implementation to fetch from vector database here-->
return ["Document 1: This is relevant information about the query."]
def generate_response(input: str) -> str:
# <--Include format prompts and call your LLM provider here-->
return "Generated response based on the prompt: " + input
@observe(
type="custom", name="RAG Pipeline", metrics=[ContextualRelevancyMetric()]
)
def rag_pipeline(query: str) -> str:
# Calls retriever and llm
docs = retrieve_documents(query)
context = "\n".join(docs)
response = generate_response(f"Context: {context}\nQuery: {query}")
update_current_span_test_case(
test_case=LLMTestCase(input=query, actual_output=response, retrieval_context=docs)
)
return response
@observe(type="agent")
def research_agent(query: str) -> str:
# Calls RAG pipeline
initial_response = rag_pipeline(query)
# Use web search tool on the results
search_results = web_search(initial_response)
# Generate final response incorporating both RAG and search results
final_response = generate_response(
f"Initial response: {initial_response}\n"
f"Additional search results: {search_results}\n"
f"Query: {query}"
)
return final_response
Then, simply use the evaluate()
function (or assert_test()
with deepeval test run
):
from deepeval.dataset import Golden
from deepeval import evaluate
...
# Create golden instead of test case
golden = Golden(input="What's the weather like in SF?")
# Run evaluation
evaluate(goldens[golden], traceable_callback=research_agent)
Notice that without tracing, creating evaluation-ready LLMTestCase
s is complicated because you have to bubble the input and returned output values for your "RAG Pipeline"
component up to the surface for evaluation.
Using deepeval test run
In CI/CD Pipelines
deepeval
allows you to run evaluations as if you're using Pytest via our Pytest integration. Instead of running the usual pytest test_file.py
command, you would instead use deepeval test run test_file.py
, which creates a test run - a collection of evaluated test cases.
deepeval test run test_llm.py
This command adds a range of functionalities on top of Pytest for unit-testing LLMs, which is enabled by 8+ optional flags. You can learn about all the different flags deepeval test run
offers in more detail below. Users typically include deepeval test run
as a command in their .yaml
files for pre-deployment checks in CI/CD pipelines (example here).
With Tracing
If you haven't already, setup tracing using the guide above, and use the assert_test()
function inside a test_file.py
file:
from deepeval.dataset import Golden
from deepeval import assert_test
...
goldens = [Golden(input="...")]
@pytest.mark.parametrize(
"golden",
goldens,
)
def test_llm_app(golden: Golden):
# Replace 'traceable_callback' with your LLM app
assert_test(golden=golden, traceable_callback=research_agent)
Two things to note:
@pytest.mark.parametrize
allows you to loop through a list of objects in Pytest.- There is no declaration of
metrics
because those are defined at the span level in themetrics
parameter.
Finally, don't forget to run the test file in the CLI:
deepeval test run test_llm_app.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function WITH tracing:
golden
: theGolden
that you wish to invoke yourtraceable_callback
with.traceable_callback
: a function callback that is your@observe
decorated LLM application. There must be AT LEAST ONE metric within one of themetrics
in your@observe
decorated LLM application.- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics in@observe
. Defaulted toTrue
.
When assert_test()
calls, what happens is your traceable_callback
is first called using the provided golden
to invoke your LLM application to create a list of test cases within your @observe
decorated spans. These test cases are then evaluated using the respective metrics
as you would noramlly expect. An execution of assert_test()
in this case passes only if all @observe
decorated metrics
passes.
Without Tracing
If your LLM application has a simplier architecture, you can use deepeval test run
without tracing:
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
...
goldens = [Golden(input="...")]
@pytest.mark.parametrize(
"golden",
goldens,
)
def test_llm_app(golden: Golden):
# Replace 'research_agent' with your LLM app
test_case = LLMTestCase(input=golden.input, actual_output=research_agent(golden.input))
assert_test(test_case=test_case, metrics=[AnswerRelevancyMetric()])
deepeval test run test_llm_app.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function WITHOUT tracing:
test_case
: anLLMTestCase
.metrics
: a list of metrics of typeBaseMetric
.- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics in@observe
. Defaulted toTrue
.
If you're logged into Confident AI, you'll also receive a fully sharable LLM testing report on the cloud. Run this in the CLI:
deepeval login
Flags for deepeval test run
:
Parallelization
Evaluate each test case in parallel by providing a number to the -n
flag to specify how many processes to use.
deepeval test run test_example.py -n 4
Cache
Provide the -c
flag (with no arguments) to read from the local deepeval
cache instead of re-evaluating test cases on the same metrics.
deepeval test run test_example.py -c
This is extremely useful if you're running large amounts of test cases. For example, lets say you're running 1000 test cases using deepeval test run
, but you encounter an error on the 999th test case. The cache functionality would allow you to skip all the previously evaluated 999 test cases, and just evaluate the remaining one.
Ignore Errors
The -i
flag (with no arguments) allows you to ignore errors for metrics executions during a test run. An example of where this is helpful is if you're using a custom LLM and often find it generating invalid JSONs that will stop the execution of the entire test run.
deepeval test run test_example.py -i
You can combine differnet flags, such as the -i
, -c
, and -n
flag to execute any uncached test cases in parallel while ignoring any errors along the way:
deepeval test run test_example.py -i -c -n 2
Verbose Mode
The -v
flag (with no arguments) allows you to turn on verbose_mode
for all metrics ran using deepeval test run
. Not supplying the -v
flag will default each metric's verbose_mode
to its value at instantiation.
deepeval test run test_example.py -v
When a metric's verbose_mode
is True
, it prints the intermediate steps used to calculate said metric to the console during evaluation.
Skip Test Cases
The -s
flag (with no arguments) allows you to skip metric executions where the test case has missing//insufficient parameters (such as retrieval_context
) that is required for evaluation. An example of where this is helpful is if you're using a metric such as the ContextualPrecisionMetric
but don't want to apply it when the retrieval_context
is None
.
deepeval test run test_example.py -s
Identifier
The -id
flag followed by a string allows you to name test runs and better identify them on Confident AI. An example of where this is helpful is if you're running automated deployment pipelines, have deployment IDs, or just want a way to identify which test run is which for comparison purposes.
deepeval test run test_example.py -id "My Latest Test Run"
Display Mode
The -d
flag followed by a string of "all", "passing", or "failing" allows you to display only certain test cases in the terminal. For example, you can display "failing" only if you only care about the failing test cases.
deepeval test run test_example.py -d "failing"
Repeats
Repeat each test case by providing a number to the -r
flag to specify how many times to rerun each test case.
deepeval test run test_example.py -r 2
Hooks
deepeval
's Pytest integration allosw you to run custom code at the end of each evaluation via the @deepeval.on_test_run_end
decorator:
...
@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
print("Test finished!")
Using The evaluate()
Function
deepeval
also offers an evaluate()
function as an alternative to deepeval test run
but without the need for Pytest or the CLI.
With Tracing
If you haven't already, learn how to setup tracing to understand this example:
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
...
# Goldens from your dataset
goldens = [Golden(input="...")]
# Evaluate with `traceable_callback`
evaluate(goldens=goldens, traceable_callback=research_agent)
There are TWO mandatory and FIVE optional parameters when calling the evaluate()
function:
golden
: a list ofGolden
s that you wish to invoke yourtraceable_callback
with.traceable_callback
: a function callback that is your@observe
decorated LLM application. There must be AT LEAST ONE metric within one of themetrics
in your@observe
decorated LLM application.- [Optional]
identifier
: a string that allows you to better identify your test run on Confident AI. - [Optional]
async_config
: an instance of typeAsyncConfig
that allows you to customize the degree concurrency during evaluation. Defaulted to the defaultAsyncConfig
values. - [Optional]
display_config
:an instance of typeDisplayConfig
that allows you to customize what is displayed to the console during evaluation. Defaulted to the defaultDisplayConfig
values. - [Optional]
error_config
: an instance of typeErrorConfig
that allows you to customize how to handle errors during evaluation. Defaulted to the defaultErrorConfig
values. - [Optional]
cache_config
: an instance of typeCacheConfig
that allows you to customize the caching behavior during evaluation. Defaulted to the defaultCacheConfig
values.
This is exactly the same as assert_test()
in deepeval test run
, but in a difference interface. A test case passes only if all metrics
within each @observe
decorated function passes.
Without Tracing
from deepeval.dataset import Golden
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
goldens = [Golden(input="...")]
# Create test cases from goldens
test_case = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
actual_output=research_agent(input),
)
test_cases.append(test_case)
# Evaluate without tracing
evaluate(test_cases=test_cases, metrics=[AnswerRelevancyMetric()])
There are TWO mandatory and SIX optional parameters when calling the evaluate()
function:
test_cases
: a list ofLLMTestCase
s ORConversationalTestCase
s, or anEvaluationDataset
. You cannot evaluateLLMTestCase
/MLLMTestCase
s andConversationalTestCase
s in the same test run.metrics
: a list of metrics of typeBaseMetric
.- [Optional]
hyperparameters
: a dict of typedict[str, Union[str, int, float]]
. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI. - [Optional]
identifier
: a string that allows you to better identify your test run on Confident AI. - [Optional]
async_config
: an instance of typeAsyncConfig
that allows you to customize the degree concurrency during evaluation. Defaulted to the defaultAsyncConfig
values. - [Optional]
display_config
:an instance of typeDisplayConfig
that allows you to customize what is displayed to the console during evaluation. Defaulted to the defaultDisplayConfig
values. - [Optional]
error_config
: an instance of typeErrorConfig
that allows you to customize how to handle errors during evaluation. Defaulted to the defaultErrorConfig
values. - [Optional]
cache_config
: an instance of typeCacheConfig
that allows you to customize the caching behavior during evaluation. Defaulted to the defaultCacheConfig
values.
This is exactly the same as assert_test()
in deepeval test run
, but in a difference interface. A test case passes only if all metrics
for each test case passes.
Configs for evaluate()
Unlike deepeval test run
where flags are used, behaviors for the evaluate()
function is controlled by "configs" (short for configuration) because there simply is too many of them to be included as individual parameters.
Async Configs
The AsyncConfig
controlls how concurrently metrics
, traceable_callback
, and test_cases
will be evaluated during evaluate()
.
from deepeval.evaluate import AsyncConfig
from deepeval import evaluate
evaluate(async_config=AsyncConfig(), ...)
There are THREE optional parameters when creating an AsyncConfig
:
- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of test cases AND metrics. Defaulted toTrue
. - [Optional]
throttle_value
: an integer that determines how long (in seconds) to throttle the evaluation of each test case. You can increase this value if your evaluation model is running into rate limit errors. Defaulted to 0. - [Optional]
max_concurrent
: an integer that determines the maximum number of test cases that can be ran in parallel at any point in time. You can decrease this value if your evaluation model is running into rate limit errors. Defaulted to20
.
The throttle_value
and max_concurrent
parameter is only used when run_async
is set to True
. A combination of a throttle_value
and max_concurrent
is the best way to handle rate limiting errors, either in your LLM judge or LLM application, when running evaluations.
Display Configs
The DisplayConfig
controlls how results and intermediate execution steps are displayed during evaluate()
.
from deepeval.evaluate import DisplayConfig
from deepeval import evaluate
evaluate(display_config=DisplayConfig(), ...)
There are FOUR optional parameters when creating an DisplayConfig
:
- [Optional]
verbose_mode
: a optional boolean which when IS NOTNone
, overrides each metric'sverbose_mode
value. Defaulted toNone
. - [Optional]
display
: a str of either"all"
,"failing"
or"passing"
, which allows you to selectively decide which type of test cases to display as the final result. Defaulted to"all"
. - [Optional]
show_indicator
: a boolean which when set toTrue
, shows the evaluation progress indicator for each individual metric. Defaulted toTrue
. - [Optional]
print_results
: a boolean which when set toTrue
, prints the result of each evaluation. Defaulted toTrue
.
Error Configs
The ErrorConfig
controlls how error is handled in evaluate()
.
from deepeval.evaluate import ErrorConfig
from deepeval import evaluate
evaluate(error_config=ErrorConfig(), ...)
There are TWO optional parameters when creating an ErrorConfig
:
- [Optional]
skip_on_missing_params
: a boolean which when set toTrue
, skips all metric executions for test cases with missing parameters. Defaulted toFalse
. - [Optional]
ignore_errors
: a boolean which when set toTrue
, ignores all exceptions raised during metrics execution for each test case. Defaulted toFalse
.
If both skip_on_missing_params
and ignore_errors
are set to True
, skip_on_missing_params
takes precedence. This means that if a metric is missing required test case parameters, it will be skipped (and the result will be missing) rather than appearing as an ignored error in the final test run.
Cache Configs
The CacheConfig
controlls the caching behavior of evaluate()
.
from deepeval.evaluate import CacheConfig
from deepeval import evaluate
evaluate(cache_config=CacheConfig(), ...)
There are TWO optional parameters when creating an CacheConfig
:
- [Optional]
use_cache
: a boolean which when set toTrue
, uses cached test run results instead. Defaulted toFalse
. - [Optional]
write_cache
: a boolean which when set toTrue
, uses writes test run results to DISK. Defaulted toTrue
.
The write_cache
parameter writes to disk and so you should disable it if that is causing any errors in your enviornment.