Component-Level
Component-level evaluation assess individual units of LLM interaction between internal components such as retrievers, tool calls, LLM generations, or even agents interacting with other agents, rather than treating the LLM app as a black box.
In end-to-end evaluation, your LLM application is treated as a black-box and evaluation is encapsulated by the overall system inputs and outputs in the form of an LLMTestCase
.
If your application has nested components or a structure that a simple LLMTestCase
can't easily handle, component-level evaluation allows you to apply different metrics to different components in your LLM application.
You would still be creating LLMTestCase
s, but this time for individual components at runtime instead of the overall system.
Common use cases that are suitable for component-level evaluation include (not inclusive):
- Chatbots/conversational agents
- Autonomous agents
- Text-SQL
- Code generation
- etc.
The trend you'll notice is use cases that are more complex in architecture are more suited for component-level evaluation.
Prerequisites
Select metrics
Unlike end-to-end evaluation, you will need to select a set of appropriate metrics for each component you want to evaluate, and ensure the LLMTestCase
s that you create in that component contains all the necessary parameters.
You should first read the metrics section to understand which metrics are suitable for which components, but alternatively you can also join our discord to ask us directly.
In component-level evaluation, there are more metrics to select as there are more individual components to evaluate.
Setup LLM application
Unlike end-to-end evaluation, where setting up your LLM application requires rewriting some parts of your code to return certain variables for testing, component-level testing is as simple as adding an @observe
decorator to apply different metrics at different component scopes.
The process of adding the @observe
decorating in your app is known as tracing, which we will learn how to setup fully in the next section.
If you're worried about how tracing via @observe
can affect your application, click here.
An @observe
decorator creates a span, and the overall collection of spans is called a trace. We'll trace this example LLM application to demonstrate how to run component-level evaluations using deepeval
in two lines of code:
from typing import List
import openai
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def your_llm_app(input: str):
def retriever(input: str):
return ["Hardcoded text chunks from your vector database"]
@observe(metrics=[AnswerRelevancyMetric()])
def generator(input: str, retrieved_chunks: List[str]):
res = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use the provided context to answer the question."},
{"role": "user", "content": "\n\n".join(retrieved_chunks) + "\n\nQuestion: " + input}
]
).choices[0].message["content"]
# Create test case at runtime
update_current_span(test_case=LLMTestCase(input=input, actual_output=res))
return res
return generator(input, retriever(input))
print(your_llm_app("How are you?"))
If you compare this implementation to the previous one in end-to-end evaluation, you'll notice that tracing with deepeval
's @observe
means we don't have to return variables such as the retrieval_context
in awkward places just to create end-to-end LLMTestCase
s.
At this point, you can either pause and learn how to setup LLM tracing in the next section before continuing, or finish this section before moving onto tracing.
Run Component-Level Evals
Once your LLM application is decorated with @observe
, you'll be able to provide it as an observed_callback
and invoked it with Golden
s to create a list of test cases within your @observe
decorated spans. These test cases are then evaluated using the respective metrics
to create a test run.
You can run component=level LLM evaluations in either:
- CI/CD pipelines using
deepeval test run
, or - Python scripts using the
evaluate()
function
Both gives you exactly the same functionality, and integrates 100% with Confident AI for sharable testing reports on the cloud.
Use evaluate()
in Python scripts
To use evaluate()
for component-level testing, supply a list of Golden
s instead of LLMTestCase
s, and an observed_callback
which is the @observe
decorated LLM application you wish to run evals on.
from somewhere import your_llm_app # Replace with your LLM app
from deepeval.dataset import Golden
from deepeval import evaluate
# Goldens from your dataset
goldens = [Golden(input="...")]
# Evaluate with `observed_callback`
evaluate(goldens=goldens, observed_callback=your_llm_app)
There are TWO mandatory and FIVE optional parameters when calling the evaluate()
function for COMPONENT-LEVEL evaluation:
golden
: a list ofGolden
s that you wish to invoke yourobserved_callback
with.observed_callback
: a function callback that is your@observe
decorated LLM application. There must be AT LEAST ONE metric within one of themetrics
in your@observe
decorated LLM application.- [Optional]
identifier
: a string that allows you to better identify your test run on Confident AI. - [Optional]
async_config
: an instance of typeAsyncConfig
that allows you to customize the degree concurrency during evaluation. Defaulted to the defaultAsyncConfig
values. - [Optional]
display_config
:an instance of typeDisplayConfig
that allows you to customize what is displayed to the console during evaluation. Defaulted to the defaultDisplayConfig
values. - [Optional]
error_config
: an instance of typeErrorConfig
that allows you to customize how to handle errors during evaluation. Defaulted to the defaultErrorConfig
values. - [Optional]
cache_config
: an instance of typeCacheConfig
that allows you to customize the caching behavior during evaluation. Defaulted to the defaultCacheConfig
values.
You'll notice that unlike end-to-end evaluation, there is no declaration of metrics
because those are defined in @observe
in the metrics
parameter, and there are no creation of LLMTestCase
s because it is handled at runtime by update_current_span
in your LLM app.
Use deepeval test run
in CI/CD pipelines
deepeval
allows you to run evaluations as if you're using Pytest via our Pytest integration.
from somewhere import your_llm_app # Replace with your LLM app
import pytest
from deepeval.dataset import Golden
from deepeval import assert_test
# Goldens from your dataset
goldens = [Golden(input="...")]
# Loop through goldens using pytest
@pytest.mark.parametrize("golden", goldens)
def test_llm_app(golden: Golden):
assert_test(golden=golden, observed_callback=your_llm_app)
Similar to the evaluate()
function, assert_test()
for component-level evaluation does not need:
- Declaration of
metrics
because those are defined at the span level in themetrics
parameter. - Creation of
LLMTestCase
s because it is handled at runtime byupdate_current_span
in your LLM app.
Finally, don't forget to run the test file in the CLI:
deepeval test run test_llm_app.py
There are TWO mandatory and ONE optional parameter when calling the assert_test()
function for COMPONENT-LEVEL evaluation:
golden
: theGolden
that you wish to invoke yourobserved_callback
with.observed_callback
: a function callback that is your@observe
decorated LLM application. There must be AT LEAST ONE metric within one of themetrics
in your@observe
decorated LLM application.- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics in@observe
. Defaulted toTrue
.
Click here to learn about different optional flags available to deepeval test run
to customize asynchronous behaviors, error handling, etc.