๐Ÿ”ฅ DeepEval 4.0 just got released. Read the announcement.
Orchestration Frameworks

LlamaIndex

Native Instrumentation
Evals in CI/CD
Evals with Traceability

LlamaIndex is an orchestration framework for data ingestion, indexing, and retrieval-augmented generation, with first-class agent and workflow primitives.

The deepeval integration registers a LlamaIndex event handler that turns every dispatch โ€” workflow runs, agent steps, LLM chats, retrieval, and tool calls โ€” into a span you can inspect, without rewriting your LlamaIndex app.

deepeval's LlamaIndex integration enables you to:

  • Trace every workflow / agent run โ€” each agent.run(...) produces a trace, and each LLM, tool, and retriever call becomes a component span.
  • Evaluate traces or model / agent components with any deepeval metric through LlmSpanContext and AgentSpanContext.
  • Run evals from scripts or CI/CD โ€” same dispatcher, different surfaces.
  • Compose with @observe and with trace(...) to evaluate larger flows that wrap one or more LlamaIndex runs.

Getting Started

Installation

pip install -U deepeval llama-index llama-index-llms-openai

The integration registers a BaseEventHandler and BaseSpanHandler against LlamaIndex's instrumentation dispatcher. After that, every workflow / agent run dispatches events that deepeval turns into spans.

Instrument and evaluate

Call instrument_llama_index(get_dispatcher()) once at startup. Wrap each agent run in with trace(agent_span_context=AgentSpanContext(metrics=[...])) to evaluate the agent span.

llamaindex_agent.py
import asyncio

from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument

from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.tracing import trace, AgentSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import TaskCompletionMetric

instrument_llama_index(instrument.get_dispatcher())

def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful calculator.",
)

async def run_agent(prompt: str):
    with trace(agent_span_context=AgentSpanContext(metrics=[TaskCompletionMetric()])):
        return await agent.run(prompt)

# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What is 8 multiplied by 6?")])

for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

Done โœ…. You've run your first eval with full traceability into LlamaIndex via deepeval.

What gets traced

Each LlamaIndex Workflow or agent.run(...) call produces a trace โ€” the end-to-end unit your user observes. Inside that trace are component spans for every dispatch LlamaIndex emits:

  • Agent spans โ€” FunctionAgent.run, Workflow.run, and nested agent steps.
  • LLM spans โ€” chat model calls (LLMChatStartEvent / LLMChatEndEvent).
  • Tool spans โ€” call_tool / acall_tool invocations.
  • Retriever spans โ€” retriever calls (RetrievalEndEvent) when your app uses retrieval.
Trace                          โ† what the user observes
โ””โ”€โ”€ Agent: math_agent          โ† one agent.run(...) call
    โ”œโ”€โ”€ LLM: gpt-4o-mini       โ† component span: model decides
    โ”œโ”€โ”€ Tool: multiply         โ† component span: tool input + output
    โ””โ”€โ”€ LLM: gpt-4o-mini       โ† component span: final answer

The trace and its component spans are independently evaluable.

Running evals

There are two surfaces for running evals against a LlamaIndex app. Pick by where you want results to surface โ€” your terminal during development, or your CI pipeline as a pass/fail gate.

In CI/CD (pytest)

Use the deepeval pytest integration. Each parametrized test invocation becomes one agent.run(...); failing metrics fail the test, which fails the build.

test_llamaindex_agent.py
import asyncio
import pytest

from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument

from deepeval import assert_test
from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric

instrument_llama_index(instrument.get_dispatcher())

def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful calculator.",
)
dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

@pytest.mark.parametrize("golden", dataset.goldens)
def test_llamaindex_agent(golden: Golden):
    asyncio.run(agent.run(golden.input))
    assert_test(golden=golden, metrics=[TaskCompletionMetric()])

Run it with:

deepeval test run test_llamaindex_agent.py

In a script

Use EvaluationDataset + evals_iterator(...). Each Golden becomes one agent run; metrics score the resulting trace.

llamaindex_agent.py
import asyncio

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
...

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 8 multiplied by 6?"),
    Golden(input="What is 7 multiplied by 9?"),
])

for golden in dataset.evals_iterator(
    async_config=AsyncConfig(run_async=True),
    metrics=[TaskCompletionMetric()],
):
    task = asyncio.create_task(agent.run(golden.input))
    dataset.evaluate(task)

LlamaIndex's agent.run(...) is async-only, so evals_iterator here uses AsyncConfig(run_async=True) and dataset.evaluate(task) to run goldens concurrently.

Applying metrics to components

The metrics=[...] you pass to evals_iterator evaluates the trace. To evaluate a component โ€” a specific agent span or LLM call โ€” stage the metric with AgentSpanContext or LlmSpanContext before the run.

Agent spans

Use AgentSpanContext(metrics=[...]) to score the agent span specifically. Useful when you want a metric on the agent step itself, distinct from the trace.

llamaindex_agent.py
from deepeval.tracing import trace, AgentSpanContext
from deepeval.metrics import TaskCompletionMetric
...

async def run_agent(prompt: str):
    with trace(agent_span_context=AgentSpanContext(metrics=[TaskCompletionMetric()])):
        return await agent.run(prompt)

LLM calls

Use LlmSpanContext(metrics=[...]) to score the next LLM span LlamaIndex opens. Useful when you want to evaluate the model's reasoning step in isolation.

llamaindex_agent.py
from deepeval.tracing import trace, LlmSpanContext
from deepeval.metrics import AnswerRelevancyMetric
...

async def run_agent(prompt: str):
    with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
        return await agent.run(prompt)

For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.

Customizing trace and span data

The integration captures inputs, outputs, model names, and tool calls automatically. For anything dynamic, the right API depends on where your code runs.

  • Use with trace(...) for trace-level fields (name, tags, metadata, thread_id, user_id, metrics).
  • Use LlmSpanContext and AgentSpanContext for component-level metric defaults and evaluation parameters.
  • Use update_current_trace(...) and update_current_span(...) from inside a tool body to mutate fields the framework can't see.
llamaindex_agent.py
from deepeval.tracing import update_current_span

def multiply(a: float, b: float) -> float:
    """Multiply two numbers."""
    update_current_span(metadata={"deterministic": True})
    return a * b

Advanced patterns

The primitives above โ€” instrument_llama_index, LlmSpanContext, AgentSpanContext, @observe, with trace(...) โ€” compose around one boundary: LlamaIndex owns the dispatcher lifecycle, and your code stages metrics for the spans it produces.

Stage component metrics with span contexts

AgentSpanContext and LlmSpanContext stage metrics for the next matching component span. Use them when you want to evaluate a sub-step instead of the full trace.

llamaindex_agent.py
from deepeval.tracing import trace, AgentSpanContext
from deepeval.metrics import TaskCompletionMetric
...

async def run_agent(prompt: str):
    with trace(agent_span_context=AgentSpanContext(metrics=[TaskCompletionMetric()])):
        return await agent.run(prompt)

No trace-level metrics required

Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because TaskCompletionMetric is attached to the agent span via AgentSpanContext, so CI/CD and scripts only need to run the agent.

This is how you'd run it:

test_llamaindex_agent.py
import asyncio
import pytest
from deepeval import assert_test
...

@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent_span(golden: Golden):
    asyncio.run(run_agent(golden.input))
    assert_test(golden=golden)
deepeval test run test_llamaindex_agent.py
llamaindex_agent.py
import asyncio
...

for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
    task = asyncio.create_task(run_agent(golden.input))
    dataset.evaluate(task)

Wrap an agent run in @observe

When the agent run is part of a larger operation, decorate the outer function with @observe. The LlamaIndex spans nest under your observed span automatically.

llamaindex_agent.py
from deepeval.tracing import observe
...

@observe(name="respond_to_user")
async def respond_to_user(prompt: str) -> str:
    result = await agent.run(prompt)
    return str(result)

Evaluate retrieval

When your LlamaIndex app uses a retriever, retrieval results are captured automatically on the retriever span. Stage LlmSpanContext with retrieval_context for any LLM that needs faithfulness-style metrics, or apply a metric directly to the retriever span via the dispatcher event.

llamaindex_agent.py
from deepeval.tracing import trace, LlmSpanContext
from deepeval.metrics import FaithfulnessMetric
...

async def run_rag(prompt: str):
    with trace(llm_span_context=LlmSpanContext(metrics=[FaithfulnessMetric()])):
        return await query_engine.aquery(prompt)

API reference

AgentSpanContext(...) and LlmSpanContext(...) accept the following kwargs. Each is read once when the next matching span is created.

KwargTypeDescription
metricslistMetrics applied to the next matching span (agent or LLM).
expected_outputstrReference output for metrics that compare against ground truth.
expected_toolslistReference tool calls for tool-aware metrics.
contextlist[str]Ideal context the model should use when answering.
retrieval_contextlist[str]Retrieved context the model actually used (LLM-only; Faithfulness, Contextual Relevancy).
promptPromptConfident AI prompt object; LLM-only.

with trace(...) accepts trace-level kwargs (name, tags, metadata, thread_id, user_id, metrics) โ€” see the tracing reference.

FAQs

On this page