Deployment
In this section we'll set up CI/CD workflows for our RAG QA agent. We'll also see how to add metrics and create spans in our RAG agent's @observe
decorators to do online evals and get full visibilty for debugging internal components.
Setup Tracing
deepeval
offers an @observe
decorator for you to apply metrics at any point in your LLM app to evaluate any LLM interaction,
this provides full visibility for debugging internal components of your LLM application. Learn more about tracing here.
During our development phase, we've added these @observe
decorators to our RAG agent for different components, we will now add metrics and create spans. Here's how you can do that:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from deepeval.metrics import (
ContextualRelevancyMetric,
ContextualRecallMetric,
ContextualPrecisionMetric,
GEval,
)
from deepeval.dataset import EvaluationDataset
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
import tempfile
class RAGAgent:
def __init__(...):
...
def _load_vector_store(self):
...
@observe(metrics=[ContextualRelevancyMetric(), ContextualRecallMetric(), ContextualPrecisionMetric()], name="Retriever")
def retrieve(self, query: str):
docs = self.vector_store.similarity_search(query, k=self.k)
context = [doc.page_content for doc in docs]
update_current_span(
test_case=LLMTestCase(
input=query,
actual_output="...",
expected_output="...",
retrieval_context=context
)
)
return context
@observe(metrics=[GEval(...), GEval(...)], name="Generator") # Use same metrics as before
def generate(
self,
query: str,
retrieved_docs: list,
llm_model=None,
prompt_template: str = None
): # Changed prompt template, model used
context = "\n".join(retrieved_docs)
model = llm_model or OpenAI(model_name="gpt-4")
prompt = prompt_template or (
"You are an AI assistant designed for factual retrieval. Using the context below, extract only the information needed to answer the user's query. Respond in strictly valid JSON using the schema below.\n\nResponse schema:\n{\n \"answer\": \"string — a precise, factual answer found in the context\",\n \"citations\": [\n \"string — exact quotes or summaries from the context that support the answer\"\n ]\n}\n\nRules:\n- Do not fabricate any information or cite anything not present in the context.\n- Do not include explanations or formatting — only return valid JSON.\n- Use complete sentences in the answer.\n- Limit the answer to the scope of the context.\n- If no answer is found in the context, return:\n{\n \"answer\": \"No relevant information available.\",\n \"citations\": []\n}\n\nContext:\n{context}\n\nQuery:\n{query}"
)
prompt = prompt.format(context=context, query=query)
answer = model(prompt)
update_current_span(
test_case=LLMTestCase(
input=query,
actual_output=answer,
retrieval_context=retrieved_docs
)
)
return answer
@observe(type="agent")
def answer(
self,
query: str,
llm_model=None,
prompt_template: str = None
):
retrieved_docs = self.retrieve(query)
generated_answer = self.generate(query, retrieved_docs, llm_model, prompt_template)
return generated_answer, retrieved_docs
Using Datasets
In the previous section, we've seen how to create datasets and store them in the cloud. We can now pull that dataset and use it in the CI/CD to evaluate our RAG agent.
Here's how we can pull datasets from the cloud:
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset()
dataset.pull(alias="QA Agent Dataset")
Integrating CI/CD
You can use pytest
with assert_test
during your CI/CD to trace and evaluate your RAG agent, here's how you can write the test file to do that:
import pytest
from deepeval.dataset import EvaluationDataset
from qa_agent import RAGAgent # import your RAG agent here
from deepeval import assert_test
dataset = EvaluationDataset()
dataset.pull(alias="QA Agent Dataset")
agent = RAGAgent() # Initialize with your best config
@pytest.mark.parametrize("golden", dataset.goldens)
def test_meeting_summarizer_components(golden):
assert_test(golden=golden, observed_callback=agent.answer)
poetry run deepeval test run test_rag_qa_agent.py
Finally, let's integrate this test into GitHub Actions to enable automated quality checks on every push.
name: RAG QA Agent DeepEval Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Install Dependencies
run: poetry install --no-root
- name: Run DeepEval Unit Tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} # Add your OPENAI_API_KEY
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }} # Add your CONFIDENT_API_KEY
run: poetry run deepeval test run test_rag_qa_agent.py
And that's it! You now have a reliable, production-ready RAG QA agent with automated evaluation integrated into your development workflow.
Setup Confident AI to track your RAG QA agent's performance across builds, regressions, and evolving datasets. It's free to get started. (No credit card required)
Learn more here.