🔥 DeepEval 4.0 just got released. Read the announcement.
Concepts

LLM Tracing

Tracing your LLM application helps you monitor its full execution from start to finish. With deepeval's @observe decorator, you can trace and evaluate any LLM interaction at any point in your app no matter how complex they may be.

Quick Summary

An LLM trace is made up of multiple individual spans. A span is a flexible, user-defined scope for evaluation or debugging. A full trace of your application contains one or more spans.

LLM Trace

The most important thing to understand is how traces and spans map to evaluation in deepeval:

  • A trace is the LLMTestCase for end-to-end evals — its input, actual_output, retrieval_context, tools_called, and expected_output describe the whole run of your LLM app.
  • A span is the LLMTestCase for component-level evals — the same parameters apply, but they describe what happened inside that one component (a retriever, a tool, an LLM call, an agent step).

This means you don't need a separate concept to evaluate traces. The primitives (LLMTestCase, metrics, goldens) you already use for unit-style evals all work on traces and spans too — you just attach them via update_current_trace and update_current_span.

Learn how deepeval's tracing is non-intrusive

deepeval's tracing is non-intrusive, it requires minimal code changes and doesn't add latency to your LLM application. It also:

  • Uses concepts you already know: Tracing a component in your LLM app takes on average 3 lines of code, which uses the same LLMTestCases and metrics that you're already familiar with.

  • Does not affect production code: If you're worried that tracing will affect your LLM calls in production, it won't. This is because the @observe decorators that you add for tracing is only invoked if called explicitly during evaluation.

  • Non-opinionated: deepeval does not care what you consider a "component" - in fact a component can be anything, at any scope, as long as you're able to set your LLMTestCase within that scope for evaluation.

Tracing only runs when you want it to run, and takes 3 lines of code:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import observe, update_current_span
from openai import OpenAI

client = OpenAI()

@observe(metrics=[AnswerRelevancyMetric()])
def get_res(query: str):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    ).choices[0].message.content

    update_current_span(input=query, output=response)
    return response

Why Tracing?

Tracing turns the local eval loop — run the agent, inspect the trace, identify the failing span, patch the prompt or code, run the eval again — into something both you and a coding agent can drive without any context switch:

Concretely, tracing your LLM application lets you:

  • Generate test cases dynamically: Many components rely on upstream outputs. Tracing lets you define LLMTestCases at runtime as data flows through the system.

  • Debug with precision: See exactly where and why things fail — whether it's tool calls, intermediate outputs, or context retrieval steps.

  • Run targeted metrics on specific components: Attach LLMTestCases to agents, tools, retrievers, or LLMs and apply metrics like answer relevancy or context precision — without needing to restructure your app.

  • Run end-to-end evals with trace data: Use the evals_iterator with metrics to perform comprehensive evaluations using your traces.

Setup Your First Trace

To set up tracing in your LLM app, you need to understand two key concepts:

  • Trace: The full execution of your app, made up of one or more spans.
  • Span: A specific component or unit of work—like an LLM call, tool invocation, or document retrieval.

You should login to see traces for free on Confident AI:

deepeval login

Finally, pick how you want to instrument your app. deepeval also offers first-class integrations for popular agent frameworks where deepeval produces traces with zero or one line of setup.

Wrap any function in your LLM app with @observe — each call becomes a span, and the outermost call becomes the trace. Spans nest naturally as @observe'd functions call each other.

main.py
from openai import OpenAI
from deepeval.tracing import observe

@observe()
def retriever(query: str) -> list[str]:
    # Your retrieval logic
    return [f"Context for the given {query}"]

@observe()
def llm_app(query: str) -> str:
    context = retriever(query)
    return OpenAI().chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"{query}\n\n{context}"}],
    ).choices[0].message.content

llm_app("Who founded DeepEval?")

@observe accepts a few optional parameters:

  • [Optional] metrics: a list of BaseMetrics to attach for component-level evals.
  • [Optional] name: how this span is displayed in the trace tree (defaults to the function name).
  • [Optional] type: classifies the span — see Classify spans by type.
  • [Optional] metric_collection: name of a metric collection you stored on Confident AI.

Build your agent with create_agent and pass deepeval's CallbackHandler to its invoke method.

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[multiply],
    system_prompt="Be concise.",
)

agent.invoke(
    {"messages": [{"role": "user", "content": "What is 3 * 12?"}]},
    config={"callbacks": [CallbackHandler()]},
)

See the LangChain integration for the full surface.

Wire your StateGraph (LangGraph's core abstraction) and pass deepeval's CallbackHandler to its invoke method.

langgraph_agent.py
from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, MessagesState, START, END
from deepeval.integrations.langchain import CallbackHandler

llm = init_chat_model("openai:gpt-4o-mini")

def chatbot(state: MessagesState):
    return {"messages": [llm.invoke(state["messages"])]}

graph = (
    StateGraph(MessagesState)
    .add_node(chatbot)
    .add_edge(START, "chatbot")
    .add_edge("chatbot", END)
    .compile()
)

graph.invoke(
    {"messages": [{"role": "user", "content": "What is 3 * 12?"}]},
    config={"callbacks": [CallbackHandler()]},
)

See the LangGraph integration for the full surface.

Drop-in replace from openai import OpenAI with from deepeval.openai import OpenAI. Every chat.completions.create(...), chat.completions.parse(...), and responses.create(...) call becomes an LLM span automatically.

openai_app.py
from deepeval.openai import OpenAI

client = OpenAI()
client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

See the OpenAI integration for the full surface (including async, streaming, and tool-calling).

Pass DeepEvalInstrumentationSettings() to your Agent's instrument keyword.

pydanticai.py
from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings

agent = Agent(
    "openai:gpt-4.1",
    system_prompt="Be concise.",
    instrument=DeepEvalInstrumentationSettings(),
)

agent.run_sync("Greetings, AI Agent.")

See the Pydantic AI integration for the full surface.

Call instrument_agentcore() before creating your AgentCore app. The same call also instruments Strands agents running inside AgentCore.

agentcore_agent.py
from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent
from deepeval.integrations.agentcore import instrument_agentcore

instrument_agentcore()

app = BedrockAgentCoreApp()
agent = Agent(model="amazon.nova-lite-v1:0")

@app.entrypoint
def invoke(payload, context):
    return {"result": str(agent(payload.get("prompt")))}

See the AgentCore integration for the full surface (including Strands-specific spans).

Call instrument_strands() before creating or invoking your Strands agent. Use this when you run Strands directly (scripts, services, notebooks); if your outer boundary is the AgentCore app entrypoint, use the AgentCore tab instead.

strands_agent.py
from strands import Agent
from strands.models.openai import OpenAIModel

from deepeval.integrations.strands import instrument_strands

instrument_strands()

agent = Agent(
    model=OpenAIModel(model_id="gpt-4o-mini"),
    system_prompt="You are a helpful assistant.",
)

agent("Help me return my order.")

See the Strands integration for the full surface.

Drop-in replace from anthropic import Anthropic with from deepeval.anthropic import Anthropic. Every messages.create(...) call becomes an LLM span automatically.

anthropic_app.py
from deepeval.anthropic import Anthropic

client = Anthropic()
client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)

See the Anthropic integration for the full surface (including async, streaming, and tool-use).

Register deepeval's event handler against LlamaIndex's instrumentation dispatcher.

llamaindex.py
import asyncio
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument

from deepeval.integrations.llama_index import instrument_llama_index

instrument_llama_index(instrument.get_dispatcher())

def multiply(a: float, b: float) -> float:
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful calculator.",
)

asyncio.run(agent.run("What is 8 multiplied by 6?"))

See the LlamaIndex integration for the full surface.

Register DeepEvalTracingProcessor once, then build your agent with deepeval's Agent and function_tool shims.

openai_agents.py
from agents import Runner, add_trace_processor
from deepeval.openai_agents import Agent, DeepEvalTracingProcessor, function_tool

add_trace_processor(DeepEvalTracingProcessor())

@function_tool
def get_weather(city: str) -> str:
    return f"It's always sunny in {city}!"

agent = Agent(
    name="weather_agent",
    instructions="Answer weather questions concisely.",
    tools=[get_weather],
)

Runner.run_sync(agent, "What's the weather in Paris?")

See the OpenAI Agents integration for the full surface.

Call instrument_google_adk() once before building your LlmAgent.

google_adk.py
import asyncio
from google.adk.agents import LlmAgent
from google.adk.runners import InMemoryRunner
from google.genai import types

from deepeval.integrations.google_adk import instrument_google_adk

instrument_google_adk()

agent = LlmAgent(model="gemini-2.0-flash", name="assistant", instruction="Be concise.")
runner = InMemoryRunner(agent=agent, app_name="deepeval-quickstart")

See the Google ADK integration for the full surface.

Call instrument_crewai() once, then build your crew with deepeval's Crew, Agent, and @tool shims.

crewai.py
from crewai import Task
from deepeval.integrations.crewai import instrument_crewai, Crew, Agent

instrument_crewai()

coder = Agent(
    role="Consultant",
    goal="Write a clear, concise explanation.",
    backstory="An expert consultant with a keen eye for software trends.",
)

task = Task(
    description="Explain the latest trends in AI.",
    agent=coder,
    expected_output="A clear and concise explanation.",
)

crew = Crew(agents=[coder], tasks=[task])
crew.kickoff()

See the CrewAI integration for the full surface.

🎉🥳 Congratulations! Calling your instrumented app now produces a trace. The rest of this page covers what to do with it — attaching test cases, classifying spans by type, and adding metadata.

Set test cases on traces and spans

This is the most important concept on this page: traces and spans both map to LLMTestCases, just at different scopes.

  • Trace = end-to-end LLMTestCase — what the user asked, what your app finally answered, what context was retrieved overall, what tools ended up being called. Used for end-to-end evals. Set with update_current_trace.
  • Span = component-level LLMTestCase — the same parameters, but scoped to what happened inside that one component (a retriever, a tool, a single LLM call). Used for component-level evals. Set with update_current_span.

Both functions accept the same LLMTestCase parameters, and both can be called from anywhere inside your @observe'd code. A typical pattern is to set span-level test cases inside the components you want to grade individually, and let trace-level data accumulate from those same spans:

main.py
from openai import OpenAI
from deepeval.tracing import observe, update_current_trace, update_current_span

@observe()
def retriever(query: str) -> list[str]:
    chunks = ["List", "of", "text", "chunks"]
    update_current_span(input=query, retrieval_context=chunks)   # span test case
    update_current_trace(retrieval_context=chunks)               # contributes to trace test case
    return chunks

@observe()
def llm_app(query: str) -> str:
    chunks = retriever(query)
    res = OpenAI().chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": f"{query}\n\n{chunks}"}],
    ).choices[0].message.content

    update_current_span(input=query, output=res)       # span test case
    update_current_trace(input=query, output=res)      # finishes trace test case
    return res

You can call either function multiple times from different spans — values are merged across calls, with later calls overriding earlier ones.

This is what lets the trace-level test case build up incrementally as data flows through your app: a retriever span contributes retrieval_context, a generator span contributes output, and you end up with a complete LLMTestCase on the trace by the time the run finishes.

Map test case parameters to traces and spans

Both update_current_trace and update_current_span accept the same set of LLMTestCase parameters, fanned out as keyword arguments. The names line up one-to-one with LLMTestCase — the only one that's been renamed is actual_output, which becomes plain output on a trace/span (it's still the same field, just shorter):

LLMTestCase parameterupdate_current_trace / update_current_span
inputinput
actual_outputoutput
expected_outputexpected_output
retrieval_contextretrieval_context
contextcontext
tools_calledtools_called
expected_toolsexpected_tools
tagstags (trace only)
metadatametadata

Prettifying traces for coding agents

Traces aren't only read by humans. When you run evals locally and a metric fails, the failing trace is also what coding agents like Claude Code, Codex, and Cursor load into context to figure out which prompt, retriever, or tool actually caused the regression.

The more self-describing the trace tree is, the less the agent has to guess from function names — and the faster it can propose a real fix instead of a generic one.

Trace name

By default, a trace has no name. Set one at runtime with update_current_trace(name=...) so the failing run reads as "Customer support flow failed at retriever" rather than "llm_app failed at retrieve":

from deepeval.tracing import observe, update_current_trace

@observe()
def llm_app(query: str):
    update_current_trace(name="Customer support flow")
    # ...

Span names default to the function name they decorate, which is usually descriptive enough — but you can override with update_current_span(name=...) whenever the function name doesn't reflect what the span actually does.

Span types

The type parameter on @observe is a label, not an eval input. It does not affect scoring — metrics only care about the scope of the span. What it does is turn the trace tree from a generic call graph into a typed one, so a coding agent reading "this retriever span returned 0 chunks for input X" gets there immediately without having to infer roles from function names.

There are four built-in types plus a custom fallback. Each type accepts a few type-specific kwargs:

typePurposeType-specific kwargs
"llm"A call to a language modelmodel, cost_per_input_token, cost_per_output_token (decorator); input_token_count, output_token_count via update_llm_span
"retriever"Fetches chunks from a vector storeembedder (decorator); top_k, chunk_size via update_retriever_span
"tool"A function the LLM/agent invokesdescription
"agent"An autonomous decision-making stepavailable_tools, handoff_agents
anything else (default)Custom — grouping or general-purpose
from deepeval.tracing import observe

@observe(type="retriever", embedder="text-embedding-3-small")
def retrieve(query: str) -> list[str]: ...

@observe(type="llm", model="gpt-4o")
def generate(prompt: str) -> str: ...

@observe(type="tool", description="Search the web for a query.")
def web_search(query: str) -> str: ...

@observe(type="agent", available_tools=["search", "calculator"])
def supervisor_agent(query: str) -> str: ...

Reference goldens at runtime

In deepeval, a golden is the reference test case used by your metrics, for example, to compare actual and expected outputs. During evaluation, you can read the active golden and pass its expected_output to spans or traces:

from deepeval.dataset import get_current_golden
from deepeval.tracing import observe, update_current_span, update_current_trace

@observe()
def tool(input: str):
    result = ...  # produce your model or tool output

    golden = get_current_golden()                  # active golden for this test
    expected = golden.expected_output if golden else None

    # set on the span (component-level)
    update_current_span(input=input, output=result, expected_output=expected)

    # or set on the trace (end-to-end)
    update_current_trace(input=input, output=result, expected_output=expected)
    return result

If you don't want to use the dataset's expected_output, pass your own string instead.


Environment Variables

If you run your @observe decorated LLM application outside of evaluate() or assert_test(), you'll notice some logs appearing in your console. To disable them completely, just set the following environment variables:

CONFIDENT_TRACE_VERBOSE=0
CONFIDENT_TRACE_FLUSH=0

Visualize and Monitor on Confident AI

Everything above runs entirely locally — you don't need an account for any of it. But once your traces start carrying real data (test cases, span types, tags, metadata, token costs), reading them in a terminal stops scaling.

Confident AI is the official platform for deepeval and renders the exact same trace data you're already producing into a UI:

Learn how to setup LLM tracing for Confident AI

You get this with zero additional code — just log in:

deepeval login

Once logged in, the same @observe-decorated app will also stream traces in real-time, let you run online evaluations on production traffic, log prompt versions on LLM spans, and visualize token costs across runs.

Next Steps

Now that you have your traces, you can run either end-to-end or component-level evals.

On this page