Skip to main content

Pydantic AI

Pydantic AI is a Python framework for building reliable, production-grade applications with Generative AI, providing type safety and validation for agent outputs and LLM interactions.

tip

We recommend logging in to Confident AI to view your Pydantic AI evaluations.

deepeval login

End-to-End Evals

deepeval allows you to evaluate Pydantic AI applications end-to-end in under a minute.

Configure Pydantic AI

Create agent and pass metrics to the deepeval's Agent wrapper.

main.py
import time
from pydantic_ai import Agent

from deepeval.integrations.pydantic_ai import instrument_pydantic_ai
instrument_pydantic_ai(api_key="<your-confident-api-key>")

agent = Agent(
"openai:gpt-4o-mini",
system_prompt="Be concise, reply with one sentence.",
)

result = agent.run_sync("What are the LLMs?")
print(result)
time.sleep(10) # wait for the trace to be posted

# Running agent in async mode

# import asyncio
# async def main():
# result = await agent.run("What are the LLMs?")
# print(result)

# if __name__ == "__main__":
# asyncio.run(main())
# time.sleep(10)
info

Evaluations are supported for Pydantic AI Agent. Only metrics with parameters input and output are eligible for evaluation.

Run evaluations

Create an EvaluationDataset and invoke your Pydantic AI application for each golden within the evals_iterator() loop to run end-to-end evaluations.

main.py
import asyncio
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.integrations.pydantic_ai import instrument_pydantic_ai, Agent

instrument_pydantic_ai(api_key="<your-confident-api-key>")
agent = Agent("openai:gpt-4o-mini", system_prompt="Be concise, reply with one sentence.")
answer_relavancy_metric = AnswerRelevancyMetric()

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(
goldens=[
Golden(input="What's 7 * 8?"),
Golden(input="What's 7 * 6?"),
]
)

for golden in dataset.evals_iterator():
task = asyncio.create_task(agent.run(
golden.input,
metrics=[answer_relavancy_metric],
))
dataset.evaluate(task)

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

View on Confident AI (optional)

note

If you need to evaluate individual components of your Pydantic AI application, set up tracing instead.

Evals in Production

To run online evaluations in production, replace metrics with a metric collection string from Confident AI, and push your Pydantic AI agent to production.

import time

from deepeval.integrations.pydantic_ai import instrument_pydantic_ai, Agent
instrument_pydantic_ai(api_key="<your-confident-api-key>")

agent = Agent(
"openai:gpt-4o-mini",
system_prompt="Be concise, reply with one sentence.",
)

result = agent.run_sync(
"What are the LLMs?",
metric_collection="test_collection_1",
)

print(result)
time.sleep(10) # wait for the trace to be posted