LlamaIndex
LlamaIndex is an orchestration framework that simplifies data ingestion, indexing, and querying, allowing developers to integrate private and public data into LLM applications for retrieval-augmented generation and knowledge augmentation.
We recommend logging in to Confident AI to view your LlamaIndex evaluation traces.
deepeval login
End-to-End Evals
deepeval
allows you to evaluate LlamaIndex applications end-to-end in under a minute.
Configure LlamaIndex
Create a FunctionAgent
with a list of metrics you wish to use, and pass it to your LlamaIndex application's run
method.
import asyncio
from llama_index.llms.openai import OpenAI
import llama_index.core.instrumentation as instrument
from deepeval.integrations.llama_index import instrument_llama_index, FunctionAgent
from deepeval.metrics import AnswerRelevancyMetric
answer_relevance_metric = AnswerRelevancyMetric()
instrument_llama_index(instrument.get_dispatcher())
def multiply(a: float, b: float) -> float:
"""Useful for multiplying two numbers."""
return a * b
agent = FunctionAgent(
tools=[multiply],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="You are a helpful assistant that can perform calculations.",
metrics=[answer_relevance_metric]
)
async def llm_app(input: str):
return await agent.run(input)
# asyncio.run(llm_app("What is 3 * 12?"))
Evaluations are supported for LlamaIndex FunctionAgent
, ReActAgent
and CodeActAgent
. Only metrics with LLM parameters input
and output
are eligible for evaluation.
Run evaluations
Create an EvaluationDataset
and invoke your LlamaIndex application for each golden within the evals_iterator()
loop to run end-to-end evaluations.
- Asynchronous
from deepeval.dataset import EvaluationDataset, Golden
dataset = EvaluationDataset(goldens=[
Golden(input="What is 3 * 12?"),
Golden(input="What is 4 * 13?")
])
for golden in dataset.evals_iterator():
task = asyncio.create_task(llm_app(golden.input))
dataset.evaluate(task)
✅ Done. The evals_iterator
will automatically generate a test run with individual evaluation traces for each golden.
View on Confident AI (optional)
If you need to evaluate individual components of your LlamaIndex application, set up tracing instead.
Evals in Production
To run online evaluations in production, simply replace metrics
in FunctionAgent
with a metric collection string from Confident AI, and push your LlamaIndex agent to production.
...
# Invoke your agent with the metric collection name
agent = FunctionAgent(
tools=[multiply],
llm=OpenAI(model="gpt-4o-mini"),
system_prompt="You are a helpful assistant that can perform calculations.",
# metrics=[answer_relevance_metric],
metric_collection="test_collection_1"
)
agent.run("What is 3 * 12?")