LLM Tracing
Tracing your LLM application helps you monitor its full execution from start to finish. With deepeval
's @observe
decorator, you can trace and evaluate any LLM interaction at any point in your app no matter how complex they may be.
Quick Summary
An LLM trace is made up of multiple individual spans. A span is a flexible, user-defined scope for evaluation or debugging. A full trace of your application contains one or more spans.
Tracing allows you run both end-to-end and component level evals which you'll learn about in the later sections.
Learn how DeepEval's tracing is non-instrusive
deepeval
's tracing is non-intrusive, it requires minimal code change and doesn't add latency to your LLM application. It also:
Uses concepts you already know: Tracing a component in your LLM app takes on average 3 lines of code, which uses the same
LLMTestCase
s and metrics that you're already familiar with.Does not affect production code: If you're worried that tracing will affect your LLM calls in production, it won't. This is because the
@observe
decorators that you add for tracing is only invoked if called explicitly during evaluation.Non-opinionated:
deepeval
does not care what you consider a "component" - in fact a component can be anything, at any scope, as long as you're able to set yourLLMTestCase
within that scope for evaluation.
Tracing only runs when you want it to run, and takes 3 lines of code:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span
from openai import OpenAI
client = OpenAI()
@observe(metrics=[AnswerRelevancyMetric()])
def get_res(query: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
).choices[0].message.content
update_current_span(input=query, output=response)
return response
Why Tracing?
Tracing your LLM applications allows you to:
Generate test cases dynamically: Many components rely on upstream outputs. Tracing lets you define
LLMTestCase
s at runtime as data flows through the system.Debug with precision: See exactly where and why things fail—whether it’s tool calls, intermediate outputs, or context retrieval steps.
Run targeted metrics on specific components: Attach
LLMTestCase
s to agents, tools, retrievers, or LLMs and apply metrics like answer relevancy or context precision—without needing to restructure your app.Run end-to-end evals with trace data: Use the
evals_iterator
withmetrics
to perform comprehensive evaluations using your traces.
Setup Tracing
To set up tracing in your LLM app, you need to understand two key concepts:
- Trace: The full execution of your app, made up of one or more spans.
- Span: A specific component or unit of work—like an LLM call, tool invocation, or document retrieval.
The @observe
decorator is the primary way to setup tracing for your LLM application. You need to find the individual components of your LLM application and decorate them with deepeval
's @observe
decorator.
Here's how you can setup tracing for you application in just a few steps:
Decorate your components
An individual function that makes up a part of your LLM application or is invoked only when necessary, can be classified as a component. You can decorate this component with deepeval
's @observe
decorator.
from openai import OpenAI
from deepeval.tracing import observe
client = OpenAI()
@observe()
def get_res(query: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
).choices[0].message.content
return response
The above get_res()
component is treated as an individual span
within a trace
.
Add test cases inside components
You can assign individual test cases to a span
by using the update_current_span
function from deepeval
. This allows you to create separate LLMTestCase
s on a component level.
from openai import OpenAI
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase
client = OpenAI()
@observe()
def get_res(query: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
).choices[0].message.content
update_current_span(input=query, output=response)
return response
You can either supply the LLMTestCase
or its parameters in the update_current_span
to create a component level test case. Learn more here.
Get your traces
You can now get your traces by simply calling your observed function or application.
query = "This will get you a trace."
get_res(query)
🎉🥳 Congratulations! You just created your first trace with deepeval
.
We highly recommend setting up Confident AI to look at your traces in an intuitive UI like this:
It's free to get started. Just the following command:
deepeval login
Observe
The @observe
decorator is a non-intrusive python decorator that you can use on top of any component as you wish. It tracks the usage of the component whenever it is evoked to create a span.
A span can contain many child spans, forming a tree structure—just like how different components of your LLM application interact
from deepeval.tracing import observe
@observe()
def generate(query: str) -> str:
context = retrieve(query)
# Your implementation
return f"Output for given {query} and {context}."
@observe()
def retrieve(query: str) -> str:
# Your implementation
return [f"Context for the given {query}"]
From the above example, an observed component generate
calling another observed component retrieve
create a nested span generate
with retrieve
inside it.
There are FOUR optional parameters when using the @observe
decorator:
- [Optional]
metric_collection
: The name of the metric collection you stored in the Confident AI platform. - [Optional]
metrics
: A list of metrics of typeBaseMetric
that will be used to evaluate your span. - [Optional]
name
: The function name or a string specifying how this span is displayed on Confident AI. - [Optional]
type
: A string specifying the type of span. The value can be any one ofllm
,retriever
,tool
, andagent
. Any other value is treated as a custom span type.
Click here to learn more about span types
For simplicity, we always recommend custom spans unless needed otherwise, since metrics
only care about the scope of the span, and supplying a specified type
is most useful only when using Confident AI. To summarize:
- Specifying a span type (like
"llm"
) allows you to supply additional parameters in the@observe
signature (e.g., themodel
used). - This information becomes extremely useful for analysis and visualization if you're using
deepeval
together with Confident AI (highly recommended). - Otherwise, for local evaluation purposes, span
type
makes no difference — evaluation still works the same way.
To learn more about the different spans type
s, or to run LLM evaluations with tracing with an UI for visualization and debugging, visiting the official Confident AI docs on LLM tracing.
Update Current Span
The update_current_span
method can be used to create a test case for the corresponding span. This is especially useful for doing component level evals or debugging your application.
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase
@observe()
def generate(query: str) -> str:
context = retrieve(query)
# Your implementation
res = f"Output for given {query} and {context}."
update_current_span(test_case=LLMTestCase(
input=query,
actual_output=res,
retrieval_context=context
))
return res
@observe()
def retrieve(query: str) -> str:
# Your implementation
context = [f"Context for the given {query}"]
update_current_span(input=query, retrieval_context=context)
return context
There are TWO ways to create test cases when using the update_current_span
function:
[Optional]
test_case
: Takes anLLMTestCase
to create a span level test case for that component.Or, You can also opt to give the values of
LLMTestCase
directly by using the following attributes:- [Optional]
input
- [Optional]
output
- [Optional]
retrieval_context
- [Optional]
context
- [Optional]
expected_output
- [Optional]
tools_called
- [Optional]
expected_tools
- [Optional]
metadata
- [Optional]
name
- [Optional]
Verbose Logs
If you run your @observe
decorated LLM application outside of evaluate()
or assert_test()
, you'll notice some logs appearing in your console. These are debug logs meant to help during development, especially if you're using Confident AI alongside deepeval
, to confirm that your LLM tracing is set up correctly.
If you're not using Confident AI, you can safely ignore these logs — they won't affect performance, introduce latency, or block any processes. To disable them completely, just set the following environment variables:
CONFIDENT_TRACE_VERBOSE="NO"
CONFIDENT_TRACE_FLUSH="NO"
If you are using Confident AI, it's still a good idea to keep these logs on during development, and then disable them in production once you've confirmed that tracing is working as expected.
Next Steps
Now that you have your traces, you can run either end-to-end or component level evals.