Pydantic AI
Pydantic AI is a Python framework for building production-grade applications with Generative AI, with type safety and validation for agent outputs and LLM interactions.
The deepeval integration auto-instruments to trace every call to your Pydantic AI Agents. Every agent run, every tool call, and every LLM call becomes a span you can inspect โ without wiring trace structure by hand.
deepeval's Pydantic AI integration enables you to:
- Auto-instrument every
Agentโ eachagent.run(...)produces a trace, and each LLM, tool, and sub-agent call inside it becomes a component span. - Evaluate the trace end-to-end or target model / agent components with any
deepevalmetric. - Run evals from a script (
evals_iterator) or from CI/CD (pytest+deepeval test run) โ same metrics, two surfaces. - Customize trace and span data at runtime from anywhere in the call stack โ your tool bodies, post-processors, or the call site.
Getting Started
Installation
pip install -U deepeval pydantic-ai opentelemetry-sdk opentelemetry-exporter-otlp-proto-httpUnder the hood the integration plugs Pydantic AI's OpenTelemetry instrumentation into deepeval's span processor.
Instrument and evaluate
Pass DeepEvalInstrumentationSettings to the Agent's instrument keyword. From that point on, any agent.run(...), agent.run_sync(...), or agent.run_stream(...) call produces a trace deepeval can read.
from pydantic_ai import Agent
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
agent = Agent(
"openai:gpt-5",
system_prompt="Be concise, reply with one sentence.",
instrument=DeepEvalInstrumentationSettings(),
)
# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather in Paris?")])
# `evals_iterator` loop through goldens and applies metrics
for golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):
agent.run_sync(golden.input) # Produces trace for evaluationDone โ
. You've run your first eval with full traceability into Pydantic AI via deepeval.
What gets traced
Each agent.run(...) call produces a trace โ the end-to-end unit your user observes, from the prompt going in to the final output coming out. Inside that trace are component spans for every step the agent took to produce the answer:
- LLM spans โ one per LLM call inside the run.
- Tool spans โ one per tool call.
- Agent spans โ nested for sub-agent calls (delegations, handoffs).
Sync, async, and streaming paths all flow through the same instrumentation โ there's nothing to configure differently between them.
Trace โ what the user observes (end-to-end)
โโโ Agent: assistant โ one agent.run(...) call
โโโ LLM: openai:gpt-5 โ component span: model decides which tool to call
โโโ Tool: get_weather โ component span: tool input + output
โโโ LLM: openai:gpt-5 โ component span: model produces the final answerThe trace and its component spans are independently evaluable. The next two sections describe how to run those evaluations.
Running evals
There are two surfaces for running evals against a Pydantic AI agent. Pick by where you want results to surface โ your terminal during a notebook session, or your CI pipeline as a pass/fail gate. Metric definitions are the same in both.
In CI/CD (pytest)
Use the deepeval pytest integration. Each parametrized test invocation becomes one agent run; failing metrics fail the test, which fails the build. This is the right surface for regression gates and pre-merge checks.
Define an EvaluationDataset at module scope, parametrize the test over its goldens, call the agent inside the test, and let assert_test evaluate the trace it just produced.
import pytest
from pydantic_ai import Agent
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.metrics import AnswerRelevancyMetric
agent = Agent(
"openai:gpt-5",
system_prompt="Be concise, reply with one sentence.",
instrument=DeepEvalInstrumentationSettings(name="my-agent"),
)
dataset = EvaluationDataset(
goldens=[
Golden(input="What's the weather in Paris?"),
Golden(input="What's the weather in London?"),
]
)
@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent(golden: Golden):
agent.run_sync(golden.input)
assert_test(golden=golden, metrics=[AnswerRelevancyMetric()])Run it with:
deepeval test run test_pydantic_ai_agent.pyThe same metrics you used in evals_iterator work unchanged here. The only difference is what surfaces the failures: a CI badge instead of a notebook cell.
In a script
Use EvaluationDataset + evals_iterator(...). Each Golden becomes one agent run; metrics score the resulting trace. This is the right surface for ad-hoc runs, notebooks, and one-off comparisons.
import asyncio
from pydantic_ai import Agent
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.integrations.pydantic_ai import DeepEvalInstrumentationSettings
from deepeval.metrics import AnswerRelevancyMetric
agent = Agent(
"openai:gpt-5",
system_prompt="Be concise, reply with one sentence.",
instrument=DeepEvalInstrumentationSettings(name="my-agent"),
)
dataset = EvaluationDataset(
goldens=[
Golden(input="What's the weather in Paris?"),
Golden(input="What's the weather in London?"),
]
)
answer_relevancy = AnswerRelevancyMetric()
for golden in dataset.evals_iterator(
async_config=AsyncConfig(run_async=True),
metrics=[answer_relevancy],
):
task = asyncio.create_task(agent.run(golden.input))
dataset.evaluate(task)evals_iterator is async-friendly; wrap each invocation in asyncio.create_task and pass it to dataset.evaluate(...) so multiple goldens run concurrently against the same dataset.
Applying metrics to components
The metrics=[...] you passed to evals_iterator in the previous section evaluates the trace โ the end-to-end behavior the user observes. To evaluate a component instead โ a specific LLM call or the agent span itself โ stage the metric with the appropriate next_*_span(...) wrapper before the run.
LLM calls
Same shape with next_llm_span(metrics=[...]). Useful when you want to evaluate the LLM's reasoning step in isolation from the tool's effect.
from deepeval.tracing import next_llm_span
async def run_agent(prompt: str):
with next_llm_span(metrics=[answer_relevancy]):
return await agent.run(prompt)
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
task = asyncio.create_task(run_agent(golden.input))
dataset.evaluate(task)Agent spans
next_agent_span(metrics=[...]) targets the agent component itself. The agent span shares its input and output with the trace, but it's a distinct unit โ use this when you want a metric on the agent span specifically (rather than the trace).
from deepeval.tracing import next_agent_span
async def run_agent(prompt: str):
with next_agent_span(metrics=[answer_relevancy]):
return await agent.run(prompt)
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
task = asyncio.create_task(run_agent(golden.input))
dataset.evaluate(task)For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.
Customizing trace and span data at runtime
Trace-level fields you set on DeepEvalInstrumentationSettings are defaults; they apply to every trace produced by that agent. For anything dynamic, the right API depends on where your code runs.
Pydantic AI creates most of the trace structure for you, which means the agent, LLM, and tool spans are mostly hidden behind agent.run(...). Calls like update_current_trace(...) and update_current_span(...) only work while there is an active deepeval trace/span in context. In practice, that means a Pydantic AI tool body is your clearest mutation point, because Pydantic has already opened the trace and the tool span before your function runs.
If you need to customize from outside a tool, use DeepEvalInstrumentationSettings for static defaults, next_*_span(...) to stage config for the next Pydantic-created span, or @observe / with trace(...) when you own the outer operation. The advanced section below shows those scenarios.
Trace-level fields from inside a tool
update_current_trace(...) mutates the active trace. Use it when a tool discovers metadata you only know during the run, like a user id, request id, retrieved document id, or routing decision.
from deepeval.tracing import update_current_trace
...
@agent.tool_plain
def fetch_user(user_id: str) -> dict:
user = users_db.get(user_id)
update_current_trace(
user_id=user_id,
metadata={"plan": user["plan"], "region": user["region"]},
)
return userSpan-level fields from inside a tool
update_current_span(...) writes to whichever span Pydantic AI just opened โ typically the tool span if you call it from inside a tool body. Useful for tagging tool-call metadata (cache hits, downstream IDs, retrieval context) without restructuring the tool.
from deepeval.tracing import update_current_span
...
@agent.tool_plain
def get_weather(city: str) -> str:
cache_hit, value = weather_cache.lookup(city)
update_current_span(
metadata={"cache_hit": cache_hit, "city": city},
output=value,
)
return valueThe general rule: settings hold defaults, next_*_span(...) stages changes before Pydantic opens the span, and update_current_*(...) mutates only after your code is already inside an active trace/span.
Advanced patterns
The primitives above โ DeepEvalInstrumentationSettings, @observe, with trace(...), next_*_span(...), update_current_*(...) โ compose around one boundary: Pydantic AI owns the auto-instrumented spans, and your code customizes them from the places it can actually see. Use @observe or with trace(...) when you own an outer workflow, next_*_span(...) when you want to configure a Pydantic-created span before it exists, and update_current_*(...) when a tool or observed function is already running inside the trace.
Evaluate subagents with next_*_span
next_*_span(metrics=[...]) stages a metric for the next matching Pydantic AI component span. Use this when you want to evaluate a subagent or model step instead of the full trace. Pick the helper that matches the span you want to score: next_agent_span(...) or next_llm_span(...).
from deepeval.tracing import next_agent_span
...
async def run_agent(prompt: str):
with next_agent_span(metrics=[answer_relevancy]):
return await agent.run(prompt)No trace-level metrics required
Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because the AnswerRelevancyMetric is attached to the next agent span, so CI/CD and scripts only need to run the subagent.
This is how you'd run it:
import asyncio
import pytest
from deepeval import assert_test
...
@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent_span(golden: Golden):
asyncio.run(run_agent(golden.input))
assert_test(golden=golden)deepeval test run test_pydantic_ai_agent.py...
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
task = asyncio.create_task(run_agent(golden.input))
dataset.evaluate(task)Wrap an agent run in @observe
When the agent run isn't your top-level unit of work โ for example, a respond_to_user(...) function that calls the agent and post-processes the result โ you can decorate that outer function with @observe. The Pydantic AI spans nest under your @observe span automatically; the result is a single trace rooted at your function with the agent run inside it.
from deepeval.tracing import observe
...
@observe(name="respond_to_user")
async def respond_to_user(prompt: str) -> str:
result = await agent.run(prompt)
return result.output.strip().upper()Multiple agent runs under one trace
When a single logical unit of work makes several agent calls (e.g. a planner agent followed by a worker agent), bracket them with with trace(...) so they share a trace_id and show up as siblings under one root.
from deepeval.tracing import trace
...
async def run_pipeline(prompt: str):
with trace(name="planner_then_worker"):
plan = await planner.run(prompt)
return await worker.run(plan.output)Mix native @observe spans with Pydantic AI spans
@observe works on any function, not just top-level ones. Decorating an internal helper inside a tool body adds a native deepeval span to the trace โ useful for evaluating retrieval steps, ranker calls, or other sub-tool logic that Pydantic AI doesn't see.
from deepeval.tracing import observe
...
@observe(name="rerank")
def rerank(docs: list[str], query: str) -> list[str]:
return sorted(docs, key=lambda d: -score(d, query))
@agent.tool_plain
def retrieve(query: str) -> list[str]:
raw = vector_store.search(query)
return rerank(raw, query)API reference
DeepEvalInstrumentationSettings(...) accepts the following trace-level kwargs. Each one is a default; runtime calls always win.
| Kwarg | Type | Description |
|---|---|---|
name | str | Default trace name. Override at runtime via update_current_trace. |
thread_id | str | Default thread identifier. Useful for grouping conversational turns. |
user_id | str | Default actor identifier. Override per-request via update_current_trace. |
metadata | dict | Default trace metadata. Merged with runtime overrides; runtime wins. |
tags | list[str] | Default tags applied to every trace produced by this agent. |
environment | str | One of "development", "staging", "production", "testing". |
For runtime helpers (update_current_trace, update_current_span, next_agent_span, next_llm_span) and the test-decorator surface (@observe, @assert_test, with trace(...)), see the tracing reference.