LlamaIndex

LlamaIndex is an orchestration framework that simplifies data ingestion, indexing, and querying, allowing developers to integrate private and public data into LLM applications for retrieval-augmented generation and knowledge augmentation.

tip

We recommend logging in to Confident AI to view your LlamaIndex evaluation traces.

deepeval login

End-to-End Evals

deepeval allows you to evaluate LlamaIndex applications end-to-end in under a minute.

Configure LlamaIndex

Create a FunctionAgent with a list of metrics you wish to use, and pass it to your LlamaIndex application's run method.

main.py
import asyncio

from llama_index.llms.openai import OpenAI
import llama_index.core.instrumentation as instrument

from deepeval.integrations.llama_index import instrument_llama_index, FunctionAgent

from deepeval.metrics import AnswerRelevancyMetric
answer_relevance_metric = AnswerRelevancyMetric()

instrument_llama_index(instrument.get_dispatcher())

def multiply(a: float, b: float) -> float:
    """Useful for multiplying two numbers."""
    return a * b

agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful assistant that can perform calculations.",
    metrics=[answer_relevance_metric]
)

async def llm_app(input: str):
    return await agent.run(input)

# asyncio.run(llm_app("What is 3 * 12?"))

info

Evaluations are supported for LlamaIndex FunctionAgent, ReActAgent and CodeActAgent. Only metrics with LLM parameters input and output are eligible for evaluation.

Run evaluations

Create an EvaluationDataset and invoke your LlamaIndex application for each golden within the evals_iterator() loop to run end-to-end evaluations.

Asynchronous

main.py
from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset(goldens=[
    Golden(input="What is 3 * 12?"),
    Golden(input="What is 4 * 13?")
])

for golden in dataset.evals_iterator():
    task = asyncio.create_task(llm_app(golden.input))
    dataset.evaluate(task)

✅ Done. The evals_iterator will automatically generate a test run with individual evaluation traces for each golden.

View on Confident AI (optional)

note

If you need to evaluate individual components of your LlamaIndex application, set up tracing instead.

Evals in Production

To run online evaluations in production, simply replace metrics in FunctionAgent with a metric collection string from Confident AI, and push your LlamaIndex agent to production.

...

# Invoke your agent with the metric collection name
agent = FunctionAgent(
    tools=[multiply],
    llm=OpenAI(model="gpt-4o-mini"),
    system_prompt="You are a helpful assistant that can perform calculations.",
    # metrics=[answer_relevance_metric],
    metric_collection="test_collection_1"
)

agent.run("What is 3 * 12?")

LlamaIndex

End-to-End Evals​

Configure LlamaIndex

Run evaluations

View on Confident AI (optional)

Evals in Production​

End-to-End Evals

Evals in Production