๐Ÿ”ฅ DeepEval 4.0 just got released. Read the announcement.
Orchestration Frameworks

LangChain

Native Instrumentation
Evals in CI/CD
Evals with Traceability

LangChain is an open-source framework for building LLM applications with models, prompts, tools, retrievers, and agents (via create_agent).

The deepeval integration traces LangChain runs through a CallbackHandler that you pass into LangChain's config. Every agent run, model call, tool call, and retriever call becomes a span you can inspect, without rewriting your LangChain app.

deepeval's LangChain integration enables you to:

  • Trace any LangChain run โ€” pass CallbackHandler(...) through config={"callbacks": [...]} per call.
  • Evaluate traces or individual components with deepeval metrics.
  • Run evals from scripts or CI/CD โ€” same callback, different surfaces.
  • Customize trace and span data through callback kwargs, LangChain metadata, and deepeval's tool decorator.

Getting Started

Installation

pip install -U deepeval langchain langchain-openai

LangChain is instrumented per-call: you decide which runs are traced by passing CallbackHandler(...) into LangChain's runtime config.

Instrument and evaluate

Create a CallbackHandler and pass it to the agent's invoke method.

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(
    model="openai:gpt-4o-mini",
    tools=[multiply],
    system_prompt="Be concise.",
)

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What is 8 multiplied by 6?")])

# The `TaskCompletionMetric` is passed into the LangChain callback.
for golden in dataset.evals_iterator():
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
    )

Done โœ…. You've run your first eval with full traceability into LangChain via deepeval.

What gets traced

Each LangChain call that receives a CallbackHandler produces a trace โ€” the end-to-end unit your user observes. Inside that trace are component spans for each callback LangChain emits:

  • Agent spans โ€” create_agent(...) runs and any nested runnable steps.
  • LLM spans โ€” chat model and completion calls.
  • Tool spans โ€” tool calls and function executions.
  • Retriever spans โ€” retriever calls, when your app uses retrieval.
Trace                           โ† what the user observes
โ””โ”€โ”€ Agent: math_agent            โ† one create_agent invoke(...) call
    โ”œโ”€โ”€ LLM: gpt-4o-mini        โ† component span: model chooses a tool
    โ”œโ”€โ”€ Tool: multiply          โ† component span: tool input + output
    โ””โ”€โ”€ LLM: gpt-4o-mini        โ† component span: final answer

The trace and its component spans are independently evaluable.

Running evals

There are two surfaces for running evals against a LangChain app. Pick by where you want results to surface โ€” your terminal during development, or your CI pipeline as a pass/fail gate.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one LangChain run; failing metrics fail the test, which fails the build.

test_langchain_agent.py
import pytest
from langchain.agents import create_agent
from deepeval import assert_test
from deepeval.integrations.langchain import CallbackHandler
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")
dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_langchain_agent(golden: Golden):
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler()]},
    )
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

Run it with:

deepeval test run test_langchain_agent.py

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one LangChain run; metrics score the resulting trace through the callback.

langchain_agent.py
dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

for golden in dataset.evals_iterator():
    agent.invoke(
        {"messages": [{"role": "user", "content": golden.input}]},
        config={"callbacks": [CallbackHandler(metrics=[TaskCompletionMetric()])]},
    )

Applying metrics to components

Passing metrics=[...] to CallbackHandler evaluates the overall LangChain run. To evaluate a component instead, attach metrics where LangChain creates that component.

LLM calls

Wrap the invocation in with next_llm_span(metrics=[...]):. The CallbackHandler drains the staged metric onto the first LLM span it opens inside the with block; later LLM calls in the same run get nothing. This is the same one-shot semantic used by next_*_span in the Pydantic AI / Strands / AgentCore / Google ADK integrations.

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import next_llm_span
...

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")

for golden in dataset.evals_iterator():
    with next_llm_span(metrics=[AnswerRelevancyMetric()]):
        agent.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

For deterministic tool calls, use tool spans for traceability, inputs, outputs, and metadata. Avoid attaching metrics directly to tool spans.

Retriever calls

Wrap the invocation in with next_retriever_span(...) to stage a metric (or a Confident AI metric_collection) on the first retriever span LangChain opens inside the with block.

langchain_agent.py
from deepeval.integrations.langchain import CallbackHandler
from deepeval.tracing import next_retriever_span
...

for golden in dataset.evals_iterator():
    with next_retriever_span(metric_collection="retriever_v1"):
        chain.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

next_retriever_span accepts the same metrics=[...] / metric_collection=... kwargs as next_llm_span. The same one-shot semantic applies: only the first retriever span in the run picks up the staged config.

Customizing trace and span data

LangChain is instrumented per-call through callbacks, so customization happens at the callback or span-staging boundary.

  • Use CallbackHandler(...) kwargs for trace-level defaults like name, tags, metadata, thread_id, and user_id.
  • Use next_llm_span(...) / next_retriever_span(...) / next_tool_span(...) to stage component-level fields (metrics, metric collections, test cases, custom span metadata) onto the next span the callback opens.
  • Use tool spans for deterministic traceability, inputs, outputs, and metadata.
langchain_agent.py
callback = CallbackHandler(
    name="math-agent",
    tags=["langchain", "math"],
    metadata={"team": "support"},
    user_id="user-123",
)

agent.invoke(
    {"messages": [{"role": "user", "content": "What is 8 multiplied by 6?"}]},
    config={"callbacks": [callback]},
)

Advanced patterns

The primitives above โ€” CallbackHandler(...), next_*_span(...), and deepeval's tool decorator โ€” compose around one boundary: LangChain owns the callback lifecycle, and your code chooses where to stage component config for the next span the callback opens.

Evaluate subagents/components

Stage a component metric with next_llm_span(...) immediately before the agent.invoke(...) call. The CallbackHandler drains the staged metric onto the first LLM span LangChain opens inside the with block, so the metric lives on the LLM span inside the agent loop without modifying the agent or model.

langchain_agent.py
from langchain.agents import create_agent
from deepeval.integrations.langchain import CallbackHandler
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.tracing import next_llm_span
...

agent = create_agent(model="openai:gpt-4o-mini", tools=[multiply], system_prompt="Be concise.")

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the AnswerRelevancyMetric is staged for the LLM span, so CI/CD and scripts only need to run the agent inside the staging block.

This is how you'd run it:

test_langchain_agent.py
import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_component_metrics(golden: Golden):
    with next_llm_span(metrics=[AnswerRelevancyMetric()]):
        agent.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )
    assert_test(golden=golden)
deepeval test run test_langchain_agent.py
langchain_agent.py
...

for golden in dataset.evals_iterator():
    with next_llm_span(metrics=[AnswerRelevancyMetric()]):
        agent.invoke(
            {"messages": [{"role": "user", "content": golden.input}]},
            config={"callbacks": [CallbackHandler()]},
        )

Wrap a LangChain run in @observe

When the LangChain call is part of a larger operation, decorate the outer function with @observe. LangChain spans nest under your observed span when the callback runs inside it.

langchain_agent.py
from deepeval.tracing import observe
...

@observe(name="respond_to_user")
def respond_to_user(prompt: str) -> str:
    result = agent.invoke(
        {"messages": [{"role": "user", "content": prompt}]},
        config={"callbacks": [CallbackHandler()]},
    )
    return result["messages"][-1].content

API reference

CallbackHandler(...) accepts the following trace-level kwargs. Each one is a default for runs that use that callback.

KwargTypeDescription
namestrDefault trace name.
tagslist[str]Tags applied to traces produced by this callback.
metadatadictTrace metadata applied when the callback starts a trace.
thread_idstrGroups related runs into a single trace thread.
user_idstrActor identifier for the trace.
metricslistMetrics applied to the LangChain run.
metric_collectionstrMetric collection applied to the LangChain run.
test_case_idstrOptional test case identifier.
turn_idstrOptional turn identifier for conversational traces.

For native tracing helpers (@observe, with trace(...), update_current_trace, update_current_span) see the tracing reference.

On this page