Skip to main content

LLM Tracing

Tracing your LLM application helps you monitor its full execution from start to finish. With deepeval's @observe decorator, you can trace and evaluate any LLM interaction at any point in your app no matter how complex they may be.

Quick Summary

An LLM trace is made up of multiple individual spans. A span is a flexible, user-defined scope for evaluation or debugging. A full trace of your application contains one or more spans.

LLM Trace

Tracing allows you run both end-to-end and component level evals which you'll learn about in the later sections.

Learn how DeepEval's tracing is non-instrusive

deepeval's tracing is non-intrusive, it requires minimal code change and doesn't add latency to your LLM application. It also:

  • Uses concepts you already know: Tracing a component in your LLM app takes on average 3 lines of code, which uses the same LLMTestCases and metrics that you're already familiar with.

  • Does not affect production code: If you're worried that tracing will affect your LLM calls in production, it won't. This is because the @observe decorators that you add for tracing is only invoked if called explicitly during evaluation.

  • Non-opinionated: deepeval does not care what you consider a "component" - in fact a component can be anything, at any scope, as long as you're able to set your LLMTestCase within that scope for evaluation.

Tracing only runs when you want it to run, and takes 3 lines of code:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span
from openai import OpenAI

client = OpenAI()

@observe(metrics=[AnswerRelevancyMetric()])
def get_res(query: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
).choices[0].message.content

update_current_span(input=query, output=response)
return response

Why Tracing?

Tracing your LLM applications allows you to:

  • Generate test cases dynamically: Many components rely on upstream outputs. Tracing lets you define LLMTestCases at runtime as data flows through the system.

  • Debug with precision: See exactly where and why things fail—whether it’s tool calls, intermediate outputs, or context retrieval steps.

  • Run targeted metrics on specific components: Attach LLMTestCases to agents, tools, retrievers, or LLMs and apply metrics like answer relevancy or context precision—without needing to restructure your app.

  • Run end-to-end evals with trace data: Use the evals_iterator with metrics to perform comprehensive evaluations using your traces.

Setup Tracing

To set up tracing in your LLM app, you need to understand two key concepts:

  • Trace: The full execution of your app, made up of one or more spans.
  • Span: A specific component or unit of work—like an LLM call, tool invocation, or document retrieval.

The @observe decorator is the primary way to setup tracing for your LLM application. You need to find the individual components of your LLM application and decorate them with deepeval's @observe decorator.

Here's how you can setup tracing for you application in just a few steps:

Decorate your components

An individual function that makes up a part of your LLM application or is invoked only when necessary, can be classified as a component. You can decorate this component with deepeval's @observe decorator.

from openai import OpenAI
from deepeval.tracing import observe

client = OpenAI()

@observe()
def get_res(query: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
).choices[0].message.content

return response

The above get_res() component is treated as an individual span within a trace.

Add test cases inside components

You can assign individual test cases to a span by using the update_current_span function from deepeval. This allows you to create separate LLMTestCases on a component level.

from openai import OpenAI
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase

client = OpenAI()

@observe()
def get_res(query: str):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
).choices[0].message.content

update_current_span(input=query, output=response)
return response

You can either supply the LLMTestCase or its parameters in the update_current_span to create a component level test case. Learn more here.

Get your traces

You can now get your traces by simply calling your observed function or application.

query = "This will get you a trace."

get_res(query)

🎉🥳 Congratulations! You just created your first trace with deepeval.

tip

We highly recommend setting up Confident AI to look at your traces in an intuitive UI like this:

Learn how to setup LLM tracing for Confident AI

It's free to get started. Just the following command:

deepeval login

Observe

The @observe decorator is a non-intrusive python decorator that you can use on top of any component as you wish. It tracks the usage of the component whenever it is evoked to create a span.

A span can contain many child spans, forming a tree structure—just like how different components of your LLM application interact

from deepeval.tracing import observe

@observe()
def generate(query: str) -> str:
context = retrieve(query)
# Your implementation
return f"Output for given {query} and {context}."

@observe()
def retrieve(query: str) -> str:
# Your implementation
return [f"Context for the given {query}"]

From the above example, an observed component generate calling another observed component retrieve create a nested span generate with retrieve inside it.

There are FOUR optional parameters when using the @observe decorator:

  • [Optional] metric_collection: The name of the metric collection you stored in the Confident AI platform.
  • [Optional] metrics: A list of metrics of type BaseMetric that will be used to evaluate your span.
  • [Optional] name: The function name or a string specifying how this span is displayed on Confident AI.
  • [Optional] type: A string specifying the type of span. The value can be any one of llm, retriever, tool, and agent. Any other value is treated as a custom span type.
Click here to learn more about span types

For simplicity, we always recommend custom spans unless needed otherwise, since metrics only care about the scope of the span, and supplying a specified type is most useful only when using Confident AI. To summarize:

  • Specifying a span type (like "llm") allows you to supply additional parameters in the @observe signature (e.g., the model used).
  • This information becomes extremely useful for analysis and visualization if you're using deepeval together with Confident AI (highly recommended).
  • Otherwise, for local evaluation purposes, span type makes no difference — evaluation still works the same way.

To learn more about the different spans types, or to run LLM evaluations with tracing with an UI for visualization and debugging, visiting the official Confident AI docs on LLM tracing.

Update Current Span

The update_current_span method can be used to create a test case for the corresponding span. This is especially useful for doing component level evals or debugging your application.

from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase

@observe()
def generate(query: str) -> str:
context = retrieve(query)
# Your implementation
res = f"Output for given {query} and {context}."
update_current_span(test_case=LLMTestCase(
input=query,
actual_output=res,
retrieval_context=context
))
return res

@observe()
def retrieve(query: str) -> str:
# Your implementation
context = [f"Context for the given {query}"]
update_current_span(input=query, retrieval_context=context)
return context

There are TWO ways to create test cases when using the update_current_span function:

  • [Optional] test_case: Takes an LLMTestCase to create a span level test case for that component.

  • Or, You can also opt to give the values of LLMTestCase directly by using the following attributes:

    • [Optional] input
    • [Optional] output
    • [Optional] retrieval_context
    • [Optional] context
    • [Optional] expected_output
    • [Optional] tools_called
    • [Optional] expected_tools
    • [Optional] metadata
    • [Optional] name

Verbose Logs

If you run your @observe decorated LLM application outside of evaluate() or assert_test(), you'll notice some logs appearing in your console. These are debug logs meant to help during development, especially if you're using Confident AI alongside deepeval, to confirm that your LLM tracing is set up correctly.

If you're not using Confident AI, you can safely ignore these logs — they won't affect performance, introduce latency, or block any processes. To disable them completely, just set the following environment variables:

CONFIDENT_TRACE_VERBOSE="NO"
CONFIDENT_TRACE_FLUSH="NO"

If you are using Confident AI, it's still a good idea to keep these logs on during development, and then disable them in production once you've confirmed that tracing is working as expected.

Next Steps

Now that you have your traces, you can run either end-to-end or component level evals.