๐Ÿ”ฅ DeepEval 4.0 just got released. Read the announcement.
Orchestration Frameworks

Pydantic AI

OTel Instrumentation
Evals in CI/CD
Evals with Traceability

Pydantic AI is a Python framework for building production-grade applications with Generative AI, with type safety and validation for agent outputs and LLM interactions.

The deepeval integration auto-instruments to trace every call to your Pydantic AI Agents. Every agent run, every tool call, and every LLM call becomes a span you can inspect โ€” without wiring trace structure by hand.

deepeval's Pydantic AI integration enables you to:

  • Auto-instrument every Agent โ€” each agent.run(...) produces a trace, and each LLM, tool, and sub-agent call inside it becomes a component span.
  • Evaluate the trace end-to-end or target model / agent components with any deepeval metric.
  • Run evals from a script (evals_iterator) or from CI/CD (pytest + deepeval test run) โ€” same metrics, two surfaces.
  • Customize trace and span data at runtime from anywhere in the call stack โ€” your tool bodies, post-processors, or the call site.

Getting Started

Installation

pip install -U deepeval pydantic-ai opentelemetry-sdk opentelemetry-exporter-otlp-proto-http

Under the hood the integration plugs Pydantic AI's OpenTelemetry instrumentation into deepeval's span processor.

Instrument and evaluate

Pass DeepEvalInstrumentationSettings to the Agent's instrument keyword. From that point on, any agent.run(...), agent.run_sync(...), or agent.run_stream(...) call produces a trace deepeval can read.

pydantic_ai_agent.py
from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

agent = Agent(
    "openai:gpt-5",
    system_prompt="Be concise, reply with one sentence.",
    instrument=DeepEvalInstrumentationSettings(),
)

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather in Paris?")])

# `evals_iterator` loop through goldens and applies metrics
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
    agent.run_sync(golden.input) # Produces trace for evaluation

Done โœ…. You've run your first eval with full traceability into Pydantic AI via deepeval.

What gets traced

Each agent.run(...) call produces a trace โ€” the end-to-end unit your user observes, from the prompt going in to the final output coming out. Inside that trace are component spans for every step the agent took to produce the answer:

  • LLM spans โ€” one per LLM call inside the run.
  • Tool spans โ€” one per tool call.
  • Agent spans โ€” nested for sub-agent calls (delegations, handoffs).

Sync, async, and streaming paths all flow through the same instrumentation โ€” there's nothing to configure differently between them.

Trace                           โ† what the user observes (end-to-end)
โ””โ”€โ”€ Agent: assistant            โ† one agent.run(...) call
    โ”œโ”€โ”€ LLM: openai:gpt-5       โ† component span: model decides which tool to call
    โ”œโ”€โ”€ Tool: get_weather       โ† component span: tool input + output
    โ””โ”€โ”€ LLM: openai:gpt-5       โ† component span: model produces the final answer

The trace and its component spans are independently evaluable. The next two sections describe how to run those evaluations.

Running evals

There are two surfaces for running evals against a Pydantic AI agent. Pick by where you want results to surface โ€” your terminal during a notebook session, or your CI pipeline as a pass/fail gate. Metric definitions are the same in both.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one agent run; failing metrics fail the test, which fails the build. This is the right surface for regression gates and pre-merge checks.

Define an EvaluationDataset at module scope, parametrize the test over its goldens, call the agent inside the test, and let assert_test evaluate the trace it just produced.

test_pydantic_ai_agent.py
import pytest

from pydantic_ai import Agent
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.metrics import AnswerRelevancyMetric

agent = Agent(
    "openai:gpt-5",
    system_prompt="Be concise, reply with one sentence.",
    instrument=DeepEvalInstrumentationSettings(name="my-agent"),
)

dataset = EvaluationDataset(
    goldens=[
        Golden(input="What's the weather in Paris?"),
        Golden(input="What's the weather in London?"),
    ]
)


@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent(golden: Golden):
    agent.run_sync(golden.input)
    assert_test(golden=golden, metrics=[AnswerRelevancyMetric()])

Run it with:

deepeval test run test_pydantic_ai_agent.py

The same metrics you used in evals_iterator work unchanged here. The only difference is what surfaces the failures: a CI badge instead of a notebook cell.

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one agent run; metrics score the resulting trace. This is the right surface for ad-hoc runs, notebooks, and one-off comparisons.

pydantic_ai_agent.py
import asyncio

from pydantic_ai import Agent
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.metrics import AnswerRelevancyMetric

agent = Agent(
    "openai:gpt-5",
    system_prompt="Be concise, reply with one sentence.",
    instrument=DeepEvalInstrumentationSettings(name="my-agent"),
)


dataset = EvaluationDataset(
    goldens=[
        Golden(input="What's the weather in Paris?"),
        Golden(input="What's the weather in London?"),
    ]
)
answer_relevancy = AnswerRelevancyMetric()

for golden in dataset.evals_iterator(
    async_config=AsyncConfig(run_async=True),
    metrics=[answer_relevancy],
):
    task = asyncio.create_task(agent.run(golden.input))
    dataset.evaluate(task)

evals_iterator is async-friendly; wrap each invocation in asyncio.create_task and pass it to dataset.evaluate(...) so multiple goldens run concurrently against the same dataset.

Applying metrics to components

The metrics=[...] you passed to evals_iterator in the previous section evaluates the trace โ€” the end-to-end behavior the user observes. To evaluate a component instead โ€” a specific LLM call or the agent span itself โ€” stage the metric with the appropriate next_*_span(...) wrapper before the run.

LLM calls

Same shape with next_llm_span(metrics=[...]). Useful when you want to evaluate the LLM's reasoning step in isolation from the tool's effect.

pydantic_ai_agent.py
from deepeval.tracing import next_llm_span


async def run_agent(prompt: str):
    with next_llm_span(metrics=[answer_relevancy]):
        return await agent.run(prompt)


for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

Agent spans

next_agent_span(metrics=[...]) targets the agent component itself. The agent span shares its input and output with the trace, but it's a distinct unit โ€” use this when you want a metric on the agent span specifically (rather than the trace).

pydantic_ai_agent.py
from deepeval.tracing import next_agent_span


async def run_agent(prompt: str):
    with next_agent_span(metrics=[answer_relevancy]):
        return await agent.run(prompt)


for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.

Customizing trace and span data at runtime

Trace-level fields you set on DeepEvalInstrumentationSettings are defaults; they apply to every trace produced by that agent. For anything dynamic, the right API depends on where your code runs.

Pydantic AI creates most of the trace structure for you, which means the agent, LLM, and tool spans are mostly hidden behind agent.run(...). Calls like update_current_trace(...) and update_current_span(...) only work while there is an active deepeval trace/span in context. In practice, that means a Pydantic AI tool body is your clearest mutation point, because Pydantic has already opened the trace and the tool span before your function runs.

If you need to customize from outside a tool, use DeepEvalInstrumentationSettings for static defaults, next_*_span(...) to stage config for the next Pydantic-created span, or @observe / with trace(...) when you own the outer operation. The advanced section below shows those scenarios.

Trace-level fields from inside a tool

update_current_trace(...) mutates the active trace. Use it when a tool discovers metadata you only know during the run, like a user id, request id, retrieved document id, or routing decision.

pydantic_ai_agent.py
from deepeval.tracing import update_current_trace
...

@agent.tool_plain
def fetch_user(user_id: str) -> dict:
    user = users_db.get(user_id)
    update_current_trace(
        user_id=user_id,
        metadata={"plan": user["plan"], "region": user["region"]},
    )
    return user

Span-level fields from inside a tool

update_current_span(...) writes to whichever span Pydantic AI just opened โ€” typically the tool span if you call it from inside a tool body. Useful for tagging tool-call metadata (cache hits, downstream IDs, retrieval context) without restructuring the tool.

pydantic_ai_agent.py
from deepeval.tracing import update_current_span
...

@agent.tool_plain
def get_weather(city: str) -> str:
    cache_hit, value = weather_cache.lookup(city)
    update_current_span(
        metadata={"cache_hit": cache_hit, "city": city},
        output=value,
    )
    return value

The general rule: settings hold defaults, next_*_span(...) stages changes before Pydantic opens the span, and update_current_*(...) mutates only after your code is already inside an active trace/span.

Advanced patterns

The primitives above โ€” DeepEvalInstrumentationSettings, @observe, with trace(...), next_*_span(...), update_current_*(...) โ€” compose around one boundary: Pydantic AI owns the auto-instrumented spans, and your code customizes them from the places it can actually see. Use @observe or with trace(...) when you own an outer workflow, next_*_span(...) when you want to configure a Pydantic-created span before it exists, and update_current_*(...) when a tool or observed function is already running inside the trace.

Evaluate subagents with next_*_span

next_*_span(metrics=[...]) stages a metric for the next matching Pydantic AI component span. Use this when you want to evaluate a subagent or model step instead of the full trace. Pick the helper that matches the span you want to score: next_agent_span(...) or next_llm_span(...).

pydantic_ai_agent.py
from deepeval.tracing import next_agent_span
...

async def run_agent(prompt: str):
    with next_agent_span(metrics=[answer_relevancy]):
        return await agent.run(prompt)

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the AnswerRelevancyMetric is attached to the next agent span, so CI/CD and scripts only need to run the subagent.

This is how you'd run it:

test_pydantic_ai_agent.py
import asyncio
import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent_span(golden: Golden):
    asyncio.run(run_agent(golden.input))
    assert_test(golden=golden)
deepeval test run test_pydantic_ai_agent.py
pydantic_ai_agent.py
...

for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

Wrap an agent run in @observe

When the agent run isn't your top-level unit of work โ€” for example, a respond_to_user(...) function that calls the agent and post-processes the result โ€” you can decorate that outer function with @observe. The Pydantic AI spans nest under your @observe span automatically; the result is a single trace rooted at your function with the agent run inside it.

pydantic_ai_agent.py
from deepeval.tracing import observe
...

@observe(name="respond_to_user")
async def respond_to_user(prompt: str) -> str:
    result = await agent.run(prompt)
    return result.output.strip().upper()

Multiple agent runs under one trace

When a single logical unit of work makes several agent calls (e.g. a planner agent followed by a worker agent), bracket them with with trace(...) so they share a trace_id and show up as siblings under one root.

pydantic_ai_agent.py
from deepeval.tracing import trace
...

async def run_pipeline(prompt: str):
    with trace(name="planner_then_worker"):
        plan = await planner.run(prompt)
        return await worker.run(plan.output)

Mix native @observe spans with Pydantic AI spans

@observe works on any function, not just top-level ones. Decorating an internal helper inside a tool body adds a native deepeval span to the trace โ€” useful for evaluating retrieval steps, ranker calls, or other sub-tool logic that Pydantic AI doesn't see.

pydantic_ai_agent.py
from deepeval.tracing import observe
...

@observe(name="rerank")
def rerank(docs: list[str], query: str) -> list[str]:
    return sorted(docs, key=lambda d: -score(d, query))


@agent.tool_plain
def retrieve(query: str) -> list[str]:
    raw = vector_store.search(query)
    return rerank(raw, query)

API reference

DeepEvalInstrumentationSettings(...) accepts the following trace-level kwargs. Each one is a default; runtime calls always win.

KwargTypeDescription
namestrDefault trace name. Override at runtime via update_current_trace.
thread_idstrDefault thread identifier. Useful for grouping conversational turns.
user_idstrDefault actor identifier. Override per-request via update_current_trace.
metadatadictDefault trace metadata. Merged with runtime overrides; runtime wins.
tagslist[str]Default tags applied to every trace produced by this agent.
environmentstrOne of "development", "staging", "production", "testing".

For runtime helpers (update_current_trace, update_current_span, next_agent_span, next_llm_span) and the test-decorator surface (@observe, @assert_test, with trace(...)), see the tracing reference.

FAQs

On this page