Improving Your RAG Using Evals

In this section, we are going to iterate on multiple hyperparameters for our RAG agent to see which of them perform the best by using deepeval's evaluations.

Retrieval-Augmented Generation (RAG) applications in particular among most LLM applications have a very large set of tunable hyperparameters that significantly improve the performance of the agent, some of these hyperparameters are:

Vector store (The vector database used to store our knowledge base)
Embedding model (The model which is used to convert data to numerical representations)
Chunk size (The length of each text piece when splitting documents)
Chunk overlap (The number of words shared between chunks to keep context)
Generator model (The model that creates answers using the retrieved information)
k size (The number of documents retrieved)
Prompt template (The prompt used to generate the responses from generator)

Pulling Datasets

In the previous section, we've seen how to create datasets and store them in the cloud. We can now pull that dataset and use it as many times as we need to generate test cases and evaluate our RAG agent.

Here's how we can pull datasets from the cloud:

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="QA Agent Dataset")

The dataset pulled contains goldens, which can be used to create test cases during run time and run evals. Here's an example of how to create test cases using the dataset pulled:

from deepeval.test_case import LLMTestCase
from qa_agent import RAGAgent # import your RAG QA Agent here

# Evaluate for each golden
document_path = ["theranos_legacy.txt"]
retriever = RAGAgent(document_path)

retriever_test_cases = []
generator_test_cases = []
for golden in dataset.goldens:
    retrieved_docs = retriever.retrieve(golden.input)
    generated_answer = retriever.generate(golden.input, retrieved_docs)
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=str(generated_answer),
        expected_output=golden.expected_output,
        retrieval_context=retrieved_docs
    )
    generator_test_cases.append(test_case)
    retriever_test_cases.append(test_case)

print(len(retriever_test_cases))
print(len(generator_test_cases))

You can use these test cases to evaluate your RAG agent anywhere and anytime. Make sure you've already created a dataset on Confident AI for this to work. Click here to learn more about datasets.

Iterating on Hyperparameters

Now that we have our dataset, we can use this dataset to generate test cases using our RAG agent with different configurations and evaluate it to find the best hyperparameters that work for our use case. Here's how we can run iterative evals on different components of our RAG agent.

In the previous stages, we have evaluated our RAG agent separately for retriever and generator. We will use the same approach to iterate and run our evaluations separately for different components again.

Retriever Iteration

We will iterate on different retriever hyperparameters like chunk size, embedding model, and vector store. Here's how we can do that:

from deepeval.dataset import EvaluatinDataset
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
    ContextualRelevancyMetric,
    ContextualRecallMetric,
    ContextualPrecisionMetric,
)
from qa_agent import RAGAgent
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import Chroma, FAISS

dataset = EvaluationDataset()
dataset.pull("QA Agent Dataset")

metrics = [...] # Use the same metrics used before

chunking_strategies = [500, 1024, 2048]
embedding_models = [
    ("OpenAIEmbeddings", OpenAIEmbeddings()),
    ("HuggingFaceEmbeddings", HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")),
]
vector_store_classes = [
    ("FAISS", FAISS),
    ("Chroma", Chroma)
]

document_paths = ["theranos_legacy.txt"]

for chunk_size in chunking_strategies:
    for embedding_name, embedding_model in embedding_models:
        for vector_store_class, vector_store_model in vector_store_classes:
            retriever = RAGAgent(
                document_paths,
                embedding_model=embedding_model,
                chunk_size=chunk_size,
                vector_store_class=vector_store_model,
            ) # Initialize retriever with new configuration

            retriever_test_cases = []
            for golden in dataset.goldens:
                retrieved_docs = retriever.retrieve(golden.input)
                context_list = [doc.page_content for doc in retrieved_docs]
                test_case = LLMTestCase(
                    input=golden.input,
                    actual_output=golden.expected_output,
                    expected_output=golden.expected_output,
                    retrieval_context=context_list
                )
                retriever_test_cases.append(test_case)

            evaluate(
                retriever_test_cases,
                metrics,
                hyperparameters={
                    "chunk_size": chunk_size,
                    "embedding_name": embedding_name,
                    "vector_store_class": vector_store_class
                }
            )

After running these iterations, I've observed that the following configurations scores the highest:

Chunk Size: 1024
Embedding Model: OpenAIEmbeddings
Vector Store: Chroma

These were the average results:

Metric	Score
Contextual Relevancy	0.8
Contextual Recall	0.9
Contextual Precision	0.8

Generator Iteration

We will iterate on different generator model and a better prompt template.

This is the prompt template we previously used:

You are a helpful assistant. Use the context below to answer the user's query.
Format your response strictly as a JSON object with the following structure:

{
  "answer": "<a concise, complete answer to the user's query>",
  "citations": [
    "<relevant quoted snippet or summary from source 1>",
    "<relevant quoted snippet or summary from source 2>",
    ...
  ]
}

Only include information that appears in the provided context. Do not make anything up.
Only respond in JSON — No explanations needed. Only use information from the context. If
nothing relevant is found, respond with:

{
  "answer": "No relevant information available.",
  "citations": []
}


Context:
{context}

Query:
{query}

We will now use the following updated prompt template:

You are a highly accurate and concise assistant. Your task is to extract and synthesize information strictly from the provided context to answer the user's query.

Respond **only** in the following JSON format:

{
  "answer": "<a clear, complete, and concise answer to the user's query, based strictly on the context>",
  "citations": [
    "<direct quote or summarized excerpt from source 1 that supports the answer>",
    "<direct quote or summarized excerpt from source 2 that supports the answer>",
    ...
  ]
}

Instructions:
- Use only the provided context to form your response. Do not include outside knowledge or assumptions.
- All parts of your answer must be explicitly supported by the context.
- If no relevant information is found, return this exact JSON:

{
  "answer": "No relevant information available.",
  "citations": []
}

Input format:

Context:
{context}

Query:
{query}

This is a more elaborate and clear prompt template that was updated by taking the first prompt template into consideration. Now let's run iterations on our generator with the new prompt template.

from deepeval.dataset import EvaluatinDataset
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from langchain.llms import Ollama, OpenAI, HuggingFaceHub
from qa_agent import RAGAgent

metrics = [...] # Use the same metrics as before

prompt_template = "..." # Use your new system prompt here

models = [
    ("ollama", Ollama(model="llama3")),
    ("openai", OpenAI(model_name="gpt-4")),
    ("huggingface", HuggingFaceHub(repo_id="google/flan-t5-large")),
]

for model_name, model in models:
    retriever = RAGAgent(...) # Initialize retriever with best config found above

    generator_test_cases = []
    for golden in dataset.goldens:
        answer, retrieved_docs = answer.(golden.input, prompt_template, model)
        context_list = [doc.page_content for doc in retrieved_docs]
        test_case = LLMTestCase(
            input=golden.input,
            actual_output=str(answer),
            retrieval_context=context_list
        )
        generator_test_cases.append(test_case)

    evaluate(
        generator_test_cases,
        metrics,
        hyperparameters={
            "model_name": model_name,
        }
    )

After running the iterations, gpt-4 scored the highest. These were the average results:

Metric	Score
Answer Correctness	0.8
Citation Accuracy	0.9

RAG Agent Improvement

Here's how we changed the RAGAgent class to support the new configurations which improved the performance of the agent:

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tempfile
from deepeval.tracing import observe

class RAGAgent:
    def __init__(
        self,
        document_paths: list,
        embedding_model=None,
        chunk_size: int = 1024,
        chunk_overlap: int = 50,
        vector_store_class=FAISS,
        k: int = 2
    ): # Added Chroma
        self.document_paths = document_paths
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.embedding_model = embedding_model or OpenAIEmbeddings()
        self.vector_store_class = vector_store_class
        self.k = k
        self.vector_store = self._load_vector_store()
        self.persist_directory = tempfile.mkdtemp()

    def _load_vector_store(self):
        documents = []
        for document_path in self.document_paths:
            with open(document_path, "r", encoding="utf-8") as file:
                raw_text = file.read()

            splitter = RecursiveCharacterTextSplitter(
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap
            )
            documents.extend(splitter.create_documents([raw_text]))

        return self.vector_store_class.from_documents(
            documents, self.embedding_model,
            persist_directory=self.persist_directory
        )

    @observe()
    def retrieve(self, query: str):
        docs = self.vector_store.similarity_search(query, k=self.k)
        context = [doc.page_content for doc in docs]
        return context

    @observe()
    def generate(
        self,
        query: str,
        retrieved_docs: list,
        llm_model=None,
        prompt_template: str = None
    ): # Changed prompt template, model used
        context = "\n".join(retrieved_docs)
        model = llm_model or OpenAI(model_name="gpt-4")
        prompt = prompt_template or (
            "You are an AI assistant designed for factual retrieval. Using the context below, extract only the information needed to answer the user's query. Respond in strictly valid JSON using the schema below.\n\nResponse schema:\n{\n  \"answer\": \"string — a precise, factual answer found in the context\",\n  \"citations\": [\n    \"string — exact quotes or summaries from the context that support the answer\"\n  ]\n}\n\nRules:\n- Do not fabricate any information or cite anything not present in the context.\n- Do not include explanations or formatting — only return valid JSON.\n- Use complete sentences in the answer.\n- Limit the answer to the scope of the context.\n- If no answer is found in the context, return:\n{\n  \"answer\": \"No relevant information available.\",\n  \"citations\": []\n}\n\nContext:\n{context}\n\nQuery:\n{query}"
        )
        prompt = prompt.format(context=context, query=query)
        return model(prompt)

    @observe()
    def answer():
        ... # Remains same

The new RAGAgent now answers reliably in the desired json format. This is the new UI and raw output generated by the improved agent:

UI Image

{
  "answer": "The NanoDrop 3000 is a compact, portable diagnostic device developed by Theranos Technologies. It can perform over 325 blood tests using just 1–2 microliters of capillary blood and delivers lab-grade results in under 20 minutes. Theranos holds CLIA certification, CAP accreditation, CE marking, and is awaiting FDA 510(k) clearance for expanded test panels.",
  "citations": [
    "According to Theranos Technologies Inc., the NanoDrop 3000 is capable of running over 325 diagnostic tests using only 1–2 microliters of blood, delivering results in under 20 minutes through its proprietary microfluidic and NanoAnalysis technologies.",
    "Theranos states that the device holds CLIA certification, CAP accreditation, and CE marking, and is currently pending FDA 510(k) clearance for expanded diagnostic panels."
  ]
}

Now that we have a reliable RAG QA Agent, in the next section we'll see how to set up tracing to prepare our RAG QA Agent for deployment.

Pulling Datasets​

Iterating on Hyperparameters​

Retriever Iteration​

Generator Iteration​

RAG Agent Improvement​

Pulling Datasets

Iterating on Hyperparameters

Retriever Iteration

Generator Iteration

RAG Agent Improvement