OpenAI
OpenAI provides chat completions and responses APIs for building LLM applications.
The deepeval integration is a drop-in replacement for OpenAI's client. Every client.chat.completions.create(...) and client.responses.create(...) call becomes an LLM span you can evaluate, without rewriting how you call the API.
deepeval's OpenAI integration enables you to:
- Drop in
deepeval.openai.OpenAIโ every chat completion or response produces an LLM span with input, output, andtools_calledcaptured automatically. - Evaluate LLM calls with any
deepevalmetric throughLlmSpanContext. - Run evals from scripts or CI/CD โ same client, different surfaces.
- Compose with
@observeandwith trace(...)to evaluate larger flows that wrap one or more OpenAI calls.
Getting Started
Installation
pip install -U deepeval openaideepeval.openai.OpenAI and deepeval.openai.AsyncOpenAI import OpenAI's classes and patch them in place. Existing kwargs, async paths, streaming, and tool-calling behavior all work unchanged.
Instrument and evaluate
Replace from openai import OpenAI with from deepeval.openai import OpenAI. Wrap each call you want to evaluate in with trace(llm_span_context=LlmSpanContext(metrics=[...])).
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
client = OpenAI()
# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What's the capital of France?")])
for golden in dataset.evals_iterator():
with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Be concise."},
{"role": "user", "content": golden.input},
],
)Done โ
. You've run your first eval against an OpenAI call with full traceability via deepeval.
What gets traced
Each patched OpenAI call produces one LLM span under the active trace. When the call uses tool-calling, the span's tools_called field captures every tool invocation the model returned โ no extra wiring needed.
- LLM spans โ one per
chat.completions.create(...),chat.completions.parse(...), orresponses.create(...)call. Captures input messages, output text, token counts, andtools_called. - Trace โ auto-created when the call has no parent. If the call runs inside
with trace(...)or@observe, the LLM span nests under that trace instead.
Trace โ auto-created or user-owned
โโโ LLM: gpt-4o โ one client.chat.completions.create(...) callThe trace and its LLM span are independently evaluable.
Running evals
There are two surfaces for running evals against OpenAI calls. Pick by where you want results to surface โ your terminal during development, or your CI pipeline as a pass/fail gate.
In CI/CD (pytest)
Use the deepeval pytest integration. Each parametrized test invocation becomes one OpenAI call; failing metrics fail the test, which fails the build.
import pytest
from deepeval import assert_test
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric
client = OpenAI()
dataset = EvaluationDataset(goldens=[
Golden(input="What's the capital of France?"),
Golden(input="Who wrote Hamlet?"),
])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_openai_app(golden: Golden):
with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Be concise."},
{"role": "user", "content": golden.input},
],
)
assert_test(golden=golden)Run it with:
deepeval test run test_openai_app.pyIn a script
Use EvaluationDataset + evals_iterator(...). Each Golden becomes one OpenAI call; metrics score the resulting LLM span.
import asyncio
from deepeval.openai import AsyncOpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import AnswerRelevancyMetric
client = AsyncOpenAI()
dataset = EvaluationDataset(goldens=[
Golden(input="What's the capital of France?"),
Golden(input="Who wrote Hamlet?"),
])
async def call_openai(prompt: str):
with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
return await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
task = asyncio.create_task(call_openai(golden.input))
dataset.evaluate(task)Sync (OpenAI) and async (AsyncOpenAI) clients both work; pick whichever matches your code.
Applying metrics to LLM spans
Passing metrics=[...] to LlmSpanContext evaluates the next OpenAI call's LLM span specifically. The same context manager lets you attach extra evaluation parameters that some metrics need.
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
client = OpenAI()
with trace(
llm_span_context=LlmSpanContext(
metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
retrieval_context=["Paris is the capital of France."],
),
):
client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the capital of France?"}],
)LlmSpanContext accepts metrics, expected_output, expected_tools, context, retrieval_context, and prompt. Each one is read by the OpenAI patch when the next LLM span is created.
Customizing trace and span data
The patch captures input messages, output text, and tools_called automatically. For anything else, the right API depends on where your code runs.
- Use
with trace(...)for trace-level fields (name,tags,metadata,thread_id,user_id). - Use
LlmSpanContextfor LLM-span-level fields the metric needs (expected_output,retrieval_context, etc.). - Use
@observeto wrap retrieval, post-processing, or any other step you want to see as its own span in the trace.
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext, observe
client = OpenAI()
@observe(type="retriever")
def retrieve_docs(query: str) -> list[str]:
return ["Paris is the capital of France."]
@observe()
def respond_to_user(prompt: str) -> str:
docs = retrieve_docs(prompt)
with trace(
llm_span_context=LlmSpanContext(retrieval_context=docs),
user_id="user-123",
tags=["openai", "rag"],
):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "\n".join(docs)},
{"role": "user", "content": prompt},
],
)
return response.choices[0].message.contentAdvanced patterns
The primitives above โ deepeval.openai.OpenAI, LlmSpanContext, @observe, with trace(...) โ compose around one boundary: the patch owns each LLM call's span, and your code chooses what trace to put it inside.
Wrap an OpenAI call in @observe
When the OpenAI call is part of a larger operation, decorate the outer function with @observe. The LLM span nests under your observed span automatically.
from deepeval.tracing import observe, trace, LlmSpanContext
from deepeval.metrics import AnswerRelevancyMetric
...
@observe(name="respond_to_user")
def respond_to_user(prompt: str) -> str:
with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.contentNo trace-level metrics required
Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because AnswerRelevancyMetric is attached to the LLM span, so CI/CD and scripts only need to call the function.
This is how you'd run it:
import pytest
from deepeval import assert_test
...
@pytest.mark.parametrize("golden", dataset.goldens)
def test_respond_to_user(golden: Golden):
respond_to_user(golden.input)
assert_test(golden=golden)deepeval test run test_openai_app.py...
for golden in dataset.evals_iterator():
respond_to_user(golden.input)Multiple OpenAI calls under one trace
When a single logical unit of work makes several OpenAI calls (e.g. a planner call followed by a respond call), bracket them with with trace(...) so the LLM spans share a trace_id and show up as siblings under one root.
from deepeval.tracing import trace
...
def plan_then_respond(prompt: str):
with trace(name="plan_then_respond"):
plan = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Plan: {prompt}"}],
)
return client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": plan.choices[0].message.content}],
)Tool-calling models
When the model returns tool calls, the LLM span's tools_called field captures them automatically. Use expected_tools on LlmSpanContext if you want to evaluate tool selection with a tool-aware metric.
from deepeval.test_case import ToolCall
from deepeval.tracing import trace, LlmSpanContext
...
with trace(
llm_span_context=LlmSpanContext(
expected_tools=[ToolCall(name="get_weather", input_parameters={"city": "Paris"})],
),
):
client.chat.completions.create(model="gpt-4o", messages=[...], tools=[...])API reference
LlmSpanContext(...) accepts the following kwargs. Each is read once when the next OpenAI call's LLM span is created.
| Kwarg | Type | Description |
|---|---|---|
metrics | list | Metrics applied to the next LLM span. |
prompt | Prompt | Confident AI prompt object; captured on the LLM span for prompt-version analytics. |
expected_output | str | Reference output for metrics that compare against ground truth. |
expected_tools | list | Reference tool calls for tool-aware metrics. |
context | list[str] | Ideal context the model should use when answering. |
retrieval_context | list[str] | Retrieved context the model actually used (Faithfulness, Contextual Relevancy, etc.). |
with trace(...) accepts trace-level kwargs (name, tags, metadata, thread_id, user_id, metrics, input, output) โ see the tracing reference.