๐Ÿ”ฅ DeepEval 4.0 just got released. Read the announcement.
Orchestration Frameworks

OpenAI

Native Instrumentation
Evals in CI/CD
Evals with Traceability

OpenAI provides chat completions and responses APIs for building LLM applications.

The deepeval integration is a drop-in replacement for OpenAI's client. Every client.chat.completions.create(...) and client.responses.create(...) call becomes an LLM span you can evaluate, without rewriting how you call the API.

deepeval's OpenAI integration enables you to:

  • Drop in deepeval.openai.OpenAI โ€” every chat completion or response produces an LLM span with input, output, and tools_called captured automatically.
  • Evaluate LLM calls with any deepeval metric through LlmSpanContext.
  • Run evals from scripts or CI/CD โ€” same client, different surfaces.
  • Compose with @observe and with trace(...) to evaluate larger flows that wrap one or more OpenAI calls.

Getting Started

Installation

pip install -U deepeval openai

deepeval.openai.OpenAI and deepeval.openai.AsyncOpenAI import OpenAI's classes and patch them in place. Existing kwargs, async paths, streaming, and tool-calling behavior all work unchanged.

Instrument and evaluate

Replace from openai import OpenAI with from deepeval.openai import OpenAI. Wrap each call you want to evaluate in with trace(llm_span_context=LlmSpanContext(metrics=[...])).

openai_app.py
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric

client = OpenAI()

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What's the capital of France?")])

for golden in dataset.evals_iterator():
    with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Be concise."},
                {"role": "user", "content": golden.input},
            ],
        )

Done โœ…. You've run your first eval against an OpenAI call with full traceability via deepeval.

What gets traced

Each patched OpenAI call produces one LLM span under the active trace. When the call uses tool-calling, the span's tools_called field captures every tool invocation the model returned โ€” no extra wiring needed.

  • LLM spans โ€” one per chat.completions.create(...), chat.completions.parse(...), or responses.create(...) call. Captures input messages, output text, token counts, and tools_called.
  • Trace โ€” auto-created when the call has no parent. If the call runs inside with trace(...) or @observe, the LLM span nests under that trace instead.
Trace                          โ† auto-created or user-owned
โ””โ”€โ”€ LLM: gpt-4o                โ† one client.chat.completions.create(...) call

The trace and its LLM span are independently evaluable.

Running evals

There are two surfaces for running evals against OpenAI calls. Pick by where you want results to surface โ€” your terminal during development, or your CI pipeline as a pass/fail gate.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one OpenAI call; failing metrics fail the test, which fails the build.

test_openai_app.py
import pytest
from deepeval import assert_test
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import AnswerRelevancyMetric

client = OpenAI()
dataset = EvaluationDataset(goldens=[
    Golden(input="What's the capital of France?"),
    Golden(input="Who wrote Hamlet?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_openai_app(golden: Golden):
    with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Be concise."},
                {"role": "user", "content": golden.input},
            ],
        )
    assert_test(golden=golden)

Run it with:

deepeval test run test_openai_app.py

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one OpenAI call; metrics score the resulting LLM span.

openai_app.py
import asyncio

from deepeval.openai import AsyncOpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import AnswerRelevancyMetric

client = AsyncOpenAI()
dataset = EvaluationDataset(goldens=[
    Golden(input="What's the capital of France?"),
    Golden(input="Who wrote Hamlet?"),
])

async def call_openai(prompt: str):
    with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
        return await client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
        )

for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
    task = asyncio.create_task(call_openai(golden.input))
    dataset.evaluate(task)

Sync (OpenAI) and async (AsyncOpenAI) clients both work; pick whichever matches your code.

Applying metrics to LLM spans

Passing metrics=[...] to LlmSpanContext evaluates the next OpenAI call's LLM span specifically. The same context manager lets you attach extra evaluation parameters that some metrics need.

openai_app.py
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

client = OpenAI()

with trace(
    llm_span_context=LlmSpanContext(
        metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()],
        retrieval_context=["Paris is the capital of France."],
    ),
):
    client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What's the capital of France?"}],
    )

LlmSpanContext accepts metrics, expected_output, expected_tools, context, retrieval_context, and prompt. Each one is read by the OpenAI patch when the next LLM span is created.

Customizing trace and span data

The patch captures input messages, output text, and tools_called automatically. For anything else, the right API depends on where your code runs.

  • Use with trace(...) for trace-level fields (name, tags, metadata, thread_id, user_id).
  • Use LlmSpanContext for LLM-span-level fields the metric needs (expected_output, retrieval_context, etc.).
  • Use @observe to wrap retrieval, post-processing, or any other step you want to see as its own span in the trace.
openai_app.py
from deepeval.openai import OpenAI
from deepeval.tracing import trace, LlmSpanContext, observe

client = OpenAI()

@observe(type="retriever")
def retrieve_docs(query: str) -> list[str]:
    return ["Paris is the capital of France."]

@observe()
def respond_to_user(prompt: str) -> str:
    docs = retrieve_docs(prompt)
    with trace(
        llm_span_context=LlmSpanContext(retrieval_context=docs),
        user_id="user-123",
        tags=["openai", "rag"],
    ):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "\n".join(docs)},
                {"role": "user", "content": prompt},
            ],
        )
    return response.choices[0].message.content

Advanced patterns

The primitives above โ€” deepeval.openai.OpenAI, LlmSpanContext, @observe, with trace(...) โ€” compose around one boundary: the patch owns each LLM call's span, and your code chooses what trace to put it inside.

Wrap an OpenAI call in @observe

When the OpenAI call is part of a larger operation, decorate the outer function with @observe. The LLM span nests under your observed span automatically.

openai_app.py
from deepeval.tracing import observe, trace, LlmSpanContext
from deepeval.metrics import AnswerRelevancyMetric
...

@observe(name="respond_to_user")
def respond_to_user(prompt: str) -> str:
    with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
        )
    return response.choices[0].message.content

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because AnswerRelevancyMetric is attached to the LLM span, so CI/CD and scripts only need to call the function.

This is how you'd run it:

test_openai_app.py
import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_respond_to_user(golden: Golden):
    respond_to_user(golden.input)
    assert_test(golden=golden)
deepeval test run test_openai_app.py
openai_app.py
...

for golden in dataset.evals_iterator():
    respond_to_user(golden.input)

Multiple OpenAI calls under one trace

When a single logical unit of work makes several OpenAI calls (e.g. a planner call followed by a respond call), bracket them with with trace(...) so the LLM spans share a trace_id and show up as siblings under one root.

openai_app.py
from deepeval.tracing import trace
...

def plan_then_respond(prompt: str):
    with trace(name="plan_then_respond"):
        plan = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Plan: {prompt}"}],
        )
        return client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": plan.choices[0].message.content}],
        )

Tool-calling models

When the model returns tool calls, the LLM span's tools_called field captures them automatically. Use expected_tools on LlmSpanContext if you want to evaluate tool selection with a tool-aware metric.

openai_app.py
from deepeval.test_case import ToolCall
from deepeval.tracing import trace, LlmSpanContext
...

with trace(
    llm_span_context=LlmSpanContext(
        expected_tools=[ToolCall(name="get_weather", input_parameters={"city": "Paris"})],
    ),
):
    client.chat.completions.create(model="gpt-4o", messages=[...], tools=[...])

API reference

LlmSpanContext(...) accepts the following kwargs. Each is read once when the next OpenAI call's LLM span is created.

KwargTypeDescription
metricslistMetrics applied to the next LLM span.
promptPromptConfident AI prompt object; captured on the LLM span for prompt-version analytics.
expected_outputstrReference output for metrics that compare against ground truth.
expected_toolslistReference tool calls for tool-aware metrics.
contextlist[str]Ideal context the model should use when answering.
retrieval_contextlist[str]Retrieved context the model actually used (Faithfulness, Contextual Relevancy, etc.).

with trace(...) accepts trace-level kwargs (name, tags, metadata, thread_id, user_id, metrics, input, output) โ€” see the tracing reference.

On this page