LlamaIndex
LlamaIndex is an orchestration framework for data ingestion, indexing, and retrieval-augmented generation, with first-class agent and workflow primitives.
The deepeval integration registers a LlamaIndex event handler that turns every dispatch โ workflow runs, agent steps, LLM chats, retrieval, and tool calls โ into a span you can inspect, without rewriting your LlamaIndex app.
deepeval's LlamaIndex integration enables you to:
- Trace every workflow / agent run โ each
agent.run(...)produces a trace, and each LLM, tool, and retriever call becomes a component span. - Evaluate traces or model / agent components with any
deepevalmetric throughLlmSpanContextandAgentSpanContext. - Run evals from scripts or CI/CD โ same dispatcher, different surfaces.
- Compose with
@observeandwith trace(...)to evaluate larger flows that wrap one or more LlamaIndex runs.
Getting Started
Installation
pip install -U deepeval llama-index llama-index-llms-openaiThe integration registers a BaseEventHandler and BaseSpanHandler against LlamaIndex's instrumentation dispatcher. After that, every workflow / agent run dispatches events that deepeval turns into spans.
Instrument and evaluate
Call instrument_llama_index(get_dispatcher()) once at startup. Wrap each agent run in with trace(agent_span_context=AgentSpanContext(metrics=[...])) to evaluate the agent span.
import asyncio
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument
from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.tracing import trace, AgentSpanContext
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
instrument_llama_index(instrument.get_dispatcher())
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
return a * b
agent = FunctionAgent(
tools=[multiply],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="You are a helpful calculator.",
)
async def run_agent(prompt: str):
with trace(agent_span_context=AgentSpanContext(metrics=[TaskCompletionMetric()])):
return await agent.run(prompt)
# Goldens are the inputs you want to evaluate.
dataset = EvaluationDataset(goldens=[Golden(input="What is 8 multiplied by 6?")])
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
task = asyncio.create_task(run_agent(golden.input))
dataset.evaluate(task)Done โ
. You've run your first eval with full traceability into LlamaIndex via deepeval.
What gets traced
Each LlamaIndex Workflow or agent.run(...) call produces a trace โ the end-to-end unit your user observes. Inside that trace are component spans for every dispatch LlamaIndex emits:
- Agent spans โ
FunctionAgent.run,Workflow.run, and nested agent steps. - LLM spans โ chat model calls (
LLMChatStartEvent/LLMChatEndEvent). - Tool spans โ
call_tool/acall_toolinvocations. - Retriever spans โ retriever calls (
RetrievalEndEvent) when your app uses retrieval.
Trace โ what the user observes
โโโ Agent: math_agent โ one agent.run(...) call
โโโ LLM: gpt-4o-mini โ component span: model decides
โโโ Tool: multiply โ component span: tool input + output
โโโ LLM: gpt-4o-mini โ component span: final answerThe trace and its component spans are independently evaluable.
Running evals
There are two surfaces for running evals against a LlamaIndex app. Pick by where you want results to surface โ your terminal during development, or your CI pipeline as a pass/fail gate.
In CI/CD (pytest)
Use the deepeval pytest integration. Each parametrized test invocation becomes one agent.run(...); failing metrics fail the test, which fails the build.
import asyncio
import pytest
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import FunctionAgent
import llama_index.core.instrumentation as instrument
from deepeval import assert_test
from deepeval.integrations.llama_index import instrument_llama_index
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import TaskCompletionMetric
instrument_llama_index(instrument.get_dispatcher())
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
return a * b
agent = FunctionAgent(
tools=[multiply],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="You are a helpful calculator.",
)
dataset = EvaluationDataset(goldens=[
Golden(input="What is 8 multiplied by 6?"),
Golden(input="What is 7 multiplied by 9?"),
])
@pytest.mark.parametrize("golden", dataset.goldens)
def test_llamaindex_agent(golden: Golden):
asyncio.run(agent.run(golden.input))
assert_test(golden=golden, metrics=[TaskCompletionMetric()])Run it with:
deepeval test run test_llamaindex_agent.pyIn a script
Use EvaluationDataset + evals_iterator(...). Each Golden becomes one agent run; metrics score the resulting trace.
import asyncio
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.evaluate.configs import AsyncConfig
from deepeval.metrics import TaskCompletionMetric
...
dataset = EvaluationDataset(goldens=[
Golden(input="What is 8 multiplied by 6?"),
Golden(input="What is 7 multiplied by 9?"),
])
for golden in dataset.evals_iterator(
async_config=AsyncConfig(run_async=True),
metrics=[TaskCompletionMetric()],
):
task = asyncio.create_task(agent.run(golden.input))
dataset.evaluate(task)LlamaIndex's agent.run(...) is async-only, so evals_iterator here uses AsyncConfig(run_async=True) and dataset.evaluate(task) to run goldens concurrently.
Applying metrics to components
The metrics=[...] you pass to evals_iterator evaluates the trace. To evaluate a component โ a specific agent span or LLM call โ stage the metric with AgentSpanContext or LlmSpanContext before the run.
Agent spans
Use AgentSpanContext(metrics=[...]) to score the agent span specifically. Useful when you want a metric on the agent step itself, distinct from the trace.
from deepeval.tracing import trace, AgentSpanContext
from deepeval.metrics import TaskCompletionMetric
...
async def run_agent(prompt: str):
with trace(agent_span_context=AgentSpanContext(metrics=[TaskCompletionMetric()])):
return await agent.run(prompt)LLM calls
Use LlmSpanContext(metrics=[...]) to score the next LLM span LlamaIndex opens. Useful when you want to evaluate the model's reasoning step in isolation.
from deepeval.tracing import trace, LlmSpanContext
from deepeval.metrics import AnswerRelevancyMetric
...
async def run_agent(prompt: str):
with trace(llm_span_context=LlmSpanContext(metrics=[AnswerRelevancyMetric()])):
return await agent.run(prompt)For deterministic tool calls, prefer update_current_span(...) to add metadata, inputs, and outputs instead of attaching metrics to the tool span.
Customizing trace and span data
The integration captures inputs, outputs, model names, and tool calls automatically. For anything dynamic, the right API depends on where your code runs.
- Use
with trace(...)for trace-level fields (name,tags,metadata,thread_id,user_id,metrics). - Use
LlmSpanContextandAgentSpanContextfor component-level metric defaults and evaluation parameters. - Use
update_current_trace(...)andupdate_current_span(...)from inside a tool body to mutate fields the framework can't see.
from deepeval.tracing import update_current_span
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
update_current_span(metadata={"deterministic": True})
return a * bAdvanced patterns
The primitives above โ instrument_llama_index, LlmSpanContext, AgentSpanContext, @observe, with trace(...) โ compose around one boundary: LlamaIndex owns the dispatcher lifecycle, and your code stages metrics for the spans it produces.
Stage component metrics with span contexts
AgentSpanContext and LlmSpanContext stage metrics for the next matching component span. Use them when you want to evaluate a sub-step instead of the full trace.
from deepeval.tracing import trace, AgentSpanContext
from deepeval.metrics import TaskCompletionMetric
...
async def run_agent(prompt: str):
with trace(agent_span_context=AgentSpanContext(metrics=[TaskCompletionMetric()])):
return await agent.run(prompt)No trace-level metrics required
Trace-level metrics are end-to-end metrics: they score the whole trace. They are not strictly necessary here because TaskCompletionMetric is attached to the agent span via AgentSpanContext, so CI/CD and scripts only need to run the agent.
This is how you'd run it:
import asyncio
import pytest
from deepeval import assert_test
...
@pytest.mark.parametrize("golden", dataset.goldens)
def test_agent_span(golden: Golden):
asyncio.run(run_agent(golden.input))
assert_test(golden=golden)deepeval test run test_llamaindex_agent.pyimport asyncio
...
for golden in dataset.evals_iterator(async_config=AsyncConfig(run_async=True)):
task = asyncio.create_task(run_agent(golden.input))
dataset.evaluate(task)Wrap an agent run in @observe
When the agent run is part of a larger operation, decorate the outer function with @observe. The LlamaIndex spans nest under your observed span automatically.
from deepeval.tracing import observe
...
@observe(name="respond_to_user")
async def respond_to_user(prompt: str) -> str:
result = await agent.run(prompt)
return str(result)Evaluate retrieval
When your LlamaIndex app uses a retriever, retrieval results are captured automatically on the retriever span. Stage LlmSpanContext with retrieval_context for any LLM that needs faithfulness-style metrics, or apply a metric directly to the retriever span via the dispatcher event.
from deepeval.tracing import trace, LlmSpanContext
from deepeval.metrics import FaithfulnessMetric
...
async def run_rag(prompt: str):
with trace(llm_span_context=LlmSpanContext(metrics=[FaithfulnessMetric()])):
return await query_engine.aquery(prompt)API reference
AgentSpanContext(...) and LlmSpanContext(...) accept the following kwargs. Each is read once when the next matching span is created.
| Kwarg | Type | Description |
|---|---|---|
metrics | list | Metrics applied to the next matching span (agent or LLM). |
expected_output | str | Reference output for metrics that compare against ground truth. |
expected_tools | list | Reference tool calls for tool-aware metrics. |
context | list[str] | Ideal context the model should use when answering. |
retrieval_context | list[str] | Retrieved context the model actually used (LLM-only; Faithfulness, Contextual Relevancy). |
prompt | Prompt | Confident AI prompt object; LLM-only. |
with trace(...) accepts trace-level kwargs (name, tags, metadata, thread_id, user_id, metrics) โ see the tracing reference.