DeepEval Blog | DeepEval - The Open-Source LLM Evaluation Framework

Build and Evaluate a Multi-Turn Chatbot Using DeepEval

June 24, 2025 · 15 min read

DeepEval Scribe

Chatbots are everywhere — powering services in healthcare, real estate, finance, and more. Thanks to modern tools and frameworks, building one has never been easier. But building a reliable chatbot? That’s the hard part.

It’s not enough for a chatbot to sound good. It needs to handle context, avoid hallucinations, stay safe, and maintain coherent multi-turn conversations. Truly reliable chatbots are only possible through rigorous evaluation and iterative improvement.

In this guide, I’ll show you how to evaluate and improve your multi-turn chatbot using DeepEval, a powerful open-source LLM evaluation framework.

TL;DR

This guide walks you through building, testing, and optimizing a multi-turn medical chatbot. It covers:

Key challenges in multi-turn conversations: memory, tone, hallucinations, and role consistency
Evaluating chatbot quality with metrics like KnowledgeRetentionMetric, RoleAdherenceMetric, and custom ConversationalGEval
Using ConversationSimulator to simulate realistic, multi-turn conversations for evaluation
Improving chatbot performance through prompt refinement and memory strategies
Running unit tests in CI/CD pipelines using DeepEval

The Unique Challenges

Multi-turn chatbots are conversational AI systems designed to remember and understand the context of an ongoing dialogue across multiple back-and-forth exchanges with a user. Unlike single-turn bots that treat each input in isolation (like a basic FAQ or search engine), multi-turn chatbots maintain memory, handle follow-up questions, and adhere to a defined persona or role. The goal is to create a smooth, realistic conversation that feels natural and coherent.

Multi-Turn Chatbot

To build a reliable chatbot, we need to understand why — and how — multi-turn chatbots break. These systems face a unique set of challenges that go far beyond generating good-sounding responses. They must:

Accurately track context across multiple exchanges
Avoid hallucinating or fabricating information
Handle ambiguity with care
Balance informativeness with tone and empathy
Know when to say I don’t know

Let’s look at how these issues show up in a real-world use case by building a medical assistant chatbot.

Building the Chatbot

Building a reliable multi-turn chatbot requires more than just generating responses. In our case, we’re creating a medical assistant that interacts directly with patients and helps address their health concerns. To do this safely, we’ll start with clear responsibilities and well-defined evaluation goals.

Our chatbot will follow three key principles:

Define a clear role: an empathetic and helpful medical assistant
Track chat history across multiple turns to remember symptoms
Generate medically accurate advice based only on prior inputs

We’ll begin with a minimal version to demonstrate core functionality. While it isn’t production-ready, it provides a solid foundation we can iterate on and evaluate using DeepEval.

Click to see the implementation of a simple multi-turn chatbot

from openai import OpenAI

client = OpenAI()


class SimpleChatbot:
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.history = [{"role": "system", "content": self.system_prompt}]

    def chat(self, user_input: str) -> str:
        self.history.append({"role": "user", "content": user_input})

        response = client.chat.completions.create(
            model="gpt-4",
            messages=self.history,
        )

        reply = response.choices[0].message.content.strip()
        self.history.append({"role": "assistant", "content": reply})
        return reply

    async def a_chat(self, user_input: str) -> str:
        self.history.append({"role": "user", "content": user_input})

        response = await client.chat.completions.acreate(
            model="gpt-4",
            messages=self.history,
        )

        reply = response.choices[0].message.content.strip()
        self.history.append({"role": "assistant", "content": reply})
        return reply

note

In production, you'd likely manage this with a more structured chatbot class or memory system. But for evaluation purposes, this minimal setup is all we need.

Here’s how you can try out the SimpleChatbot in practice:

chatbot = SimpleChatbot(
    system_prompt="You are a helpful and empathetic medical assistant. Answer questions clearly using known medical knowledge only."
)

print(chatbot.chat("Hi, I've had a cough and fever."))
print(chatbot.chat("Now I have a headache too. Should I be worried?"))

This example demonstrates how the chatbot maintains context across multiple turns and provides responses based on prior information. While it appears to generate accurate and relevant outputs, surface-level observation isn’t enough to determine its reliability — especially in sensitive domains like healthcare.

Evaluating a multi-turn chatbot remains a complex task. That’s where DeepEval helps. It enables structured evaluation of LLM-based applications using real-world metrics that reflect true conversational quality — including memory handling, role consistency, and tone.

Here are the key metrics DeepEval offers for evaluating any multi-turn chatbot:

Turn Relevancy — Checks whether the chatbot's responses remain relevant to the user's input.
Role Adherence — Measures how consistently the chatbot stays aligned with its assigned persona or role.
Knowledge Retention — Assesses whether the chatbot remembers critical context from earlier turns in the conversation.
Conversation Completeness — Evaluates if the responses are thorough and adequately address user inputs.
Custom metrics — Allows for tailored evaluation criteria based on domain-specific needs, such as empathy, safety, or tone.

Evaluating Your Chatbot with DeepEval

Our chatbot is built on 3 key principles which we've defined in the previous section, using those 3 principles we'll be defining our evaluation metrics:

Role Adherence: Measures how consistently the chatbot stays in character as a professional, empathetic medical assistant.
Knowledge Retention: Assesses whether the chatbot remembers earlier parts of the conversation, such as symptoms.
Medical Assistant Quality: A custom metric that evaluates the overall conversational quality.

Identifying the right metrics is only part of the challenge — the real bottleneck is having quality data to evaluate against. Evaluating multi-turn chatbots requires realistic conversations that simulate how users actually interact, including follow-ups, ambiguity, and varied tone. Creating these test cases manually is slow, repetitive, and often where teams hit a wall.

DeepEval solves this with its built-in Conversation Simulator, which automatically generates high-quality simulations based on your chatbot’s role. This removes a major barrier to rigorous testing and makes it easy to evaluate your chatbot continuously as it evolves.

Simulating conversations

Here’s how you can use the ConversationSimulator to generate synthetic ConversationalTestCases.

import asyncio
from deepeval.simulator import ConversationSimulator
from deepeval.test_case import ConversationalTestCase
from typing import List, Dict
from chatbot import SimpleChatbot  # Assuming your chatbot class is in chatbot.py

# Define user intentions for our medical chatbot
user_intentions = {
    "reporting new symptoms and seeking advice": 3,
    "asking about medication side effects": 2,
    "inquiring about illness prevention": 1,
}

# Optional user profile attributes to add variation
user_profile_items = [
    "patient's age",
    "known allergies",
    "current medications",
]

# Initialize chatbot with system prompt
chatbot = SimpleChatbot(
    system_prompt="You are a helpful and empathetic medical assistant. Answer clearly using only medically accurate information."
)

# Define simulator
simulator = ConversationSimulator(
    user_intentions=user_intentions, user_profile_items=user_profile_items
)

# Define model callback for simulator
async def chatbot_callback(
    user_input: str, conversation_history: List[Dict[str, str]]
) -> str:
    chatbot.history = [{"role": "system", "content": chatbot.system_prompt}]
    for turn in conversation_history:
        chatbot.history.append({"role": "user", "content": turn["user_input"]})
        chatbot.history.append({"role": "assistant", "content": turn["agent_response"]})

    reply = await chatbot.a_chat(user_input)
    return reply


# Run the simulation
async def run_simulation():
    print("Starting conversation simulation...")
    convo_test_cases: List[ConversationalTestCase] = await simulator.simulate(
        model_callback=chatbot_callback,
        stopping_criteria="Stop when the user's medical concern is addressed with actionable advice.",
        min_turns=3,
        max_turns=6,
    )
    print(f"\nGenerated {len(convo_test_cases)} conversational test cases.")


if __name__ == "__main__":
    asyncio.run(run_simulation())

And just like that, you've got realistic, multi-turn test cases — without spending hours writing them yourself.

Evaluating the chatbot

With the new simulated test cases in place, we can now evaluate how the chatbot performs. Using the metrics defined earlier — role adherence, knowledge retention, and overall response quality — we’ll assess its behavior across the realistic multi-turn conversations generated by the simulator. Here's how to run the evaluation using DeepEval:

from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationalGEval,
)
from deepeval import evaluate

# Assign role to each test case for Role Adherence evaluation
for test_case in convo_test_cases:
    test_case.chatbot_role = "a professional, empathetic medical assistant"

# Define evaluation metrics
metrics = [
    KnowledgeRetentionMetric(),
    RoleAdherenceMetric(),
    ConversationalGEval(
        name="MedicalAssistantQuality",
        criteria="Evaluate the assistant's response in a medical context, considering medical accuracy, completeness, empathy, and avoidance of risky or overly confident advice.",
    ),
]

# Run evaluation
evaluate(test_cases=convo_test_cases, metrics=metrics)

With the evaluation complete, it's clear our chatbot has room for improvement. These were the results when I evaluated the chatbot:

Metric	Score
Knowledge Retention	0.7
Role Adherence	0.6
Medical Assistant Quality	0.5

While knowledge retention seems to be performing well, the chatbot struggles with maintaining its assigned role and delivering high-quality responses in a medical context. These gaps reduce its reliability, especially in multi-turn interactions where trust and clarity are essential.

Two main factors contribute to this outcome: a generic system prompt and the way conversation history is handled. Currently, the chatbot uses the full history across turns without filtering or summarization. Although this retains context, it increases the risk of overwhelming the model’s context window and leads to inconsistent behavior as conversations grow longer. LLMs often struggle with long, unstructured inputs — especially when tasked with remembering key details over multiple exchanges.

In the next section, we'll explore how refining the prompt and introducing a more structured memory strategy can help improve performance across all three metrics.

Improving Your Chatbot with DeepEval

Improving a chatbot’s performance often comes down to adjusting a few key hyperparameters — the fundamental settings that influence how it behaves in real-world conversations.

For multi-turn chatbots, these are the parameters that typically have the biggest impact:

LLM choice
Prompt design
Chat history management

Click here to see the changes that were made to SimpleChatbot class to support hyperparameters.

from openai import OpenAI
from typing import Literal

client = OpenAI()

class SimpleChatbot:
    def __init__(
        self,
        system_prompt: str,
        llm: str = "gpt-4",
        history_mode: Literal["full", "windowed", "summary"] = "full",
        history_window: int = 6,
        summarizer_model: str = "gpt-3.5-turbo"
    ):
        self.system_prompt = system_prompt
        self.llm = llm
        self.history_mode = history_mode
        self.history_window = history_window
        self.summarizer_model = summarizer_model
        self.history = []
        self.summary = ""

    def chat(self, user_input: str) -> str:
        # Build messages based on history strategy
        if self.history_mode == "summary":
            messages = [
                {"role": "system", "content": f"{self.system_prompt}\n\nSummary:\n{self.summary}"},
                {"role": "user", "content": user_input}
            ]
        else:
            messages = [{"role": "system", "content": self.system_prompt}]
            if self.history_mode == "windowed":
                messages += self.history[-self.history_window:]
            else:  # full
                messages += self.history
            messages.append({"role": "user", "content": user_input})

        # Get assistant reply
        response = client.chat.completions.create(
            model=self.llm,
            messages=messages,
            temperature=0,
        )
        reply = response.choices[0].message.content.strip()

        # Update full history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": reply})

        # If summary mode, regenerate summary from history
        if self.history_mode == "summary":
            summary_prompt = "Summarize the following conversation between a patient and a medical assistant. Keep it concise and medically relevant:\n\n"
            full_transcript = ""
            for msg in self.history:
                if msg["role"] == "user":
                    full_transcript += f"User: {msg['content']}\n"
                elif msg["role"] == "assistant":
                    full_transcript += f"Assistant: {msg['content']}\n"

            summary_response = client.chat.completions.create(
                model=self.summarizer_model,
                messages=[
                    {"role": "system", "content": summary_prompt},
                    {"role": "user", "content": full_transcript}
                ],
                temperature=0,
            )
            self.summary = summary_response.choices[0].message.content.strip()

        return reply

    async def a_chat(self, user_input: str) -> str:
        # Use `acreate` method and implement the asynchronous chat method here

Now that our chatbot supports these hyperparameters, we can begin experimenting with different combinations to see which configuration performs best across evaluation metrics.

from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationalGEval,
)
from deepeval import evaluate
from chatbot import SimpleChatbot

# --- Evaluation Metrics ---
metrics = [...]  # Use the same metrics we've previously defined

# Prompt variations
prompt_templates = [
    "You are a helpful and empathetic medical assistant. Answer clearly using only medically accurate information.",
    "You are a medical assistant. Avoid giving prescriptions or diagnoses. Recommend seeing a doctor when unsure.",
    "You are a friendly but cautious medical assistant. Always answer with verified medical facts. If the input is unclear or serious, gently encourage the user to consult a healthcare provider. Avoid assumptions or overconfidence.",
    "You are a professional medical assistant. Do not diagnose, speculate, or provide treatment plans. Stick strictly to factual medical information. For all specific concerns, direct the patient to a licensed physician.",
]

# OpenAI model options
models = ["gpt-3.5-turbo", "gpt-4"]

# History modes to test
history_modes = ["full", "windowed", "summary"]

# Create a simulate_conversations function that takes the chatbot as an argument and returns convo_test_cases that were simulated.
def simulate_conversations(chatbot):
    ...


# Run evaluations across all combinations
for model_name in models:
    for prompt in prompt_templates:
        for mode in history_modes:
            print(f"\nEvaluating: Model = {model_name}, History = {mode}")

            # Create chatbot with given config
            chatbot = SimpleChatbot(
                system_prompt=prompt,
                llm=model_name,
                history_mode=mode,
            )

            # Call the simulate_conversations function with the new chatbot
            convo_test_cases = simulate_conversations(chatbot)

            # Assign chatbot role for evaluation
            for test_case in convo_test_cases:
                test_case.chatbot_role = "a professional, empathetic medical assistant"

            # Evaluate and print metrics
            evaluate(test_cases=convo_test_cases, metrics=metrics)

After running all combinations, one configuration clearly stood out:

Prompt Template: Prompt 3 — strict, factual, safety-first
Model: GPT-4
History Strategy: Summary mode

This setup consistently delivered high scores across all evaluation metrics:

Metric	Score
Knowledge Retention	0.9
Role Adherence	0.9
Medical Assistant Quality	0.8

Here’s a quick before-and-after comparison:

Metric	Initial Version	Optimized Version
Knowledge Retention	0.7	0.9
Role Adherence	0.6	0.9
Medical Assistant Quality	0.5	0.8

The improvements are substantial — especially in knowledge tracking and maintaining a consistent, reliable assistant persona. With a stronger prompt and a structured memory strategy, the chatbot becomes much more suitable for production use in sensitive domains like healthcare.

Takeaways

Switching to Prompt Template 3, GPT-4, and summary history mode led to significant improvements across all key metrics.

Both KnowledgeRetentionMetric and RoleAdherenceMetric reached scores of 0.9, while MedicalAssistantQuality improved from 0.5 to 0.8 — a clear sign of better consistency, safety, and relevance.

These results weren’t accidental. With focused prompt design and memory strategy, and by evaluating the right metrics, meaningful progress becomes measurable — and repeatable.

Multi-turn chatbot test flow using DeepEval’s ConversationSimulator

This is how we can use DeepEval to create reliable multi-turn chatbots.

Unit Testing in CI/CD for Continuous Evaluation

Maintaining chatbot reliability over time requires more than strong initial performance. As you update prompts, switch models, or adjust memory strategies, even small changes can introduce subtle regressions.

To ensure consistent behavior, unit testing is essential. By writing automated tests for your chatbot’s core conversational flows, you can detect issues early and prevent quality from degrading as your system evolves.

DeepEval simplifies this process. With just a few lines of code, you can write unit tests for realistic conversations, run them in your CI/CD pipeline, and receive clear feedback when something breaks.

Here’s how to integrate DeepEval into your pipeline to validate your chatbot with every commit:

test_chatbot_quality.py
import pytest
import asyncio
from typing import List, Dict
from deepeval.test_case import ConversationalTestCase
from deepeval.simulator import ConversationSimulator
from deepeval.metrics import (
    KnowledgeRetentionMetric,
    RoleAdherenceMetric,
    ConversationalGEval,
)
from deepeval import assert_test
from simple_chatbot import SimpleChatbot  # Make sure this matches your file name

# Define user intentions for our medical chatbot (used by ConversationSimulator)
user_intentions = {
    "reporting new symptoms and seeking advice": 3,
    "asking about medication side effects": 2,
    "inquiring about illness prevention": 1,
}

# Optional user profile attributes to add variation (used by ConversationSimulator)
user_profile_items = [
    "patient's age",
    "known allergies",
    "current medications",
]

# Initialize chatbot with a default configuration for simulation setup
# This chatbot instance will be passed to simulate_conversations
chatbot_for_simulation_setup = SimpleChatbot(
    system_prompt="You are a friendly but cautious medical assistant. Always answer with verified medical facts. If the input is unclear or serious, gently encourage the user to consult a healthcare provider. Avoid assumptions or overconfidence.",
    llm="gpt-4",
    history_mode="summary",
)

# Define evaluation metrics
metrics = [
    KnowledgeRetentionMetric(threshold=0.8),
    RoleAdherenceMetric(threshold=0.8),
    ConversationalGEval(
        name="MedicalAssistantQuality",
        criteria=(
            "Evaluate whether the assistant's response is medically accurate, complete, empathetic, "
            "and avoids risky, speculative, or overconfident advice."
        ),
        threshold=0.8,
    ),
]

# The simulate_conversations function, now a placeholder as requested.
def simulate_conversations(chatbot):
    ...


# Generate test cases by simulating conversations with the chatbot
# This line now correctly calls the synchronous wrapper function.
test_cases = simulate_conversations(chatbot_for_simulation_setup)

# Assign role to each test case for Role Adherence evaluation
# This is done once after all test cases are generated
for test_case in test_cases:
    test_case.chatbot_role = "a professional, empathetic medical assistant"

# Parametrized CI/CD test function
@pytest.mark.parametrize("test_case", test_cases)
def test_chatbot_performance(test_case: ConversationalTestCase):
    assert_test(test_case, metrics)

This test file plugs straight into any CI setup (GitHub Actions, GitLab CI, etc.), so your chatbot keeps meeting quality and safety standards with every push. Just run:

bash
poetry run deepeval test run test_chatbot_quality.py

Now let’s write our GitHub actions file to complete our CI integration.

.github/workflows/deepeval-tests.yml
name: Medical Chatbot DeepEval Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install Poetry
        run: |
          curl -sSL https://install.python-poetry.org | python3 -
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Install Dependencies
        run: poetry install --no-root

      - name: Run DeepEval Unit Tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: poetry run deepeval test run test_chatbot_quality.py

Conclusion

We’ve seen how even a simple chatbot can miss the mark — and how DeepEval helps you go deeper than surface-level performance to test what actually matters: memory, tone, safety, empathy, and relevance.

By simulating real conversations, defining the right metrics, and plugging evaluation into CI, you catch issues early — before they ever reach a real user. No guesswork. No assumptions. Just measurable, repeatable quality.

Whether you're fixing hallucinations or fine-tuning prompts, the mindset is the same: treat your chatbot like any other critical system — test it, iterate on it, and never ship blind.

Already have a bot in production? Start evaluating it. You might be surprised by what you find.

Evaluate a RAG-Based Contract Assistant with DeepEval

June 12, 2025 · 13 min read

Cale

DeepEval Scribe

Jeffrey Ip

DeepEval Wizard

Imagine this — You’re building a contract assistant for a mid-sized law firm with over 300 employees and a repository of more than 10,000 archived contracts and internal policies.

You need to build a Retrieval-Augmented Generation (RAG) system designed to help lawyers, paralegals, and HR personnel quickly find precise answers to complex queries about contracts, policies, and compliance.

In this scenario, the reliability of the RAG system is absolutely critical. There is no room for error. Think of a scenario where the assistant hallucinates contract clauses that don’t exist, cites outdated or superseded policies, or misses key compliance requirements. These failures could lead to costly legal risks, compliance violations, or internal confusion that could jeopardize client trust and company reputation.

This tutorial walks you through how to build a reliable RAG system with DeepEval, focusing on:

Automatically generating high-quality test data from your own docs
Component-level evaluation for both retrievers and generators
Integrating CI/CD tests that adapt as your contracts evolve

By the end of this tutorial, you’ll have a deployable RAG app that’s not only smart — it’s battle-tested.

Evaluating Your Retriever with DeepEval

A hallucination doesn’t start in generation — it starts in retrieval. If your retriever surfaces irrelevant or incomplete context, your LLM is doomed before it even starts generating. In high-stakes use cases like contracts or compliance, one bad passage can trigger a cascade of wrong answers — or worse, legal risk.

Building a basic retriever

Let’s say you’re using a standard FAISS + OpenAIEmbeddings retriever.

Click here to see the implementation of a simple retriever

from langchain.vectorstores import Chroma, FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

class SimpleRetriever:
    def __init__(
        self,
        document_path: str,
        embedding_model=None,
        chunk_size: int = 500,
        chunk_overlap: int = 50,
        vector_store_class=FAISS,
        persist_directory: str = None,
        k: int = 2
    ):
        self.document_path = document_path
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.embedding_model = embedding_model or OpenAIEmbeddings()
        self.vector_store_class = vector_store_class
        self.persist_directory = persist_directory
        self.k = k
        self.vector_store = self._load_vector_store()


    def _load_vector_store(self):
        with open(self.document_path, "r", encoding="utf-8") as file:
            raw_text = file.read()

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap
        )
        documents = splitter.create_documents([raw_text])

        if self.vector_store_class == Chroma:
            return self.vector_store_class.from_documents(
                documents, self.embedding_model,
                persist_directory=self.persist_directory
            )
        else:
            return self.vector_store_class.from_documents(documents, self.embedding_model)


    def retrieve(self, query: str):
        return self.vector_store.similarity_search(query, k=self.k)


# Initialize retriever
retriever = SimpleRetriever("document.txt")

# Query the retriever
query = "What benefits do part-time employees get?"
results = retriever.retrieve(query)

This retriever works — but how well?

Here’s what we need to consider when evaluating retrievers:

Contextual Relevancy – Is this the info I’d want if I were answering this question?
Contextual Recall – Did I retrieve enough of the good stuff?
Contextual Precision – Did I avoid junk I don’t need?

But knowing what to evaluate isn’t enough, here comes the hardest part of evaluating retrievers. Retrievers cannot be evaluated without a ground-truth to evaluate them against. This means we need question and answer "pairs" that we can use to evaluate our retriever against from our original documents. But this is a tedious, expensive, and time-consuming step.

DeepEval helps you get around this with its built-in synthesizer, which can generate high-quality question–answer pairs from your raw documents — automating a huge part of the process and setting you up for continuous testing down the line.

Generating Goldens

Here’s how easy it is to generate those goldens:

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["document.txt"], chunk_size=500, chunk_overlap=50
)

Now we can use these generated goldens to evaluate our retriever. Here’s how we can evaluate our retriever using the 3 metrics mentioned before:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    ContextualRelevancyMetric,
    ContextualRecallMetric,
    ContextualPrecisionMetric,
)

# Initialize metrics
relevancy = ContextualRelevancyMetric()
recall = ContextualRecallMetric()
precision = ContextualPrecisionMetric()

# Evaluate for each golden
for golden in goldens:
    retrieved_docs = retriever.retrieve(golden.input)
    context_list = [doc.page_content for doc in retrieved_docs]
    test_case = LLMTestCase(
        input=golden.input,
        actual_output=golden.expected_output,
        expected_output=golden.expected_output,
        retrieval_context=context_list
    )
    relevancy.measure(test_case)
    recall.measure(test_case)
    precision.measure(test_case)

    print(f"Q: {golden.input}\nA: {golden.expected_output}")
    print(f"Relevancy: {relevancy.score}, Recall: {recall.score}, Precision: {precision.score}")

When I did the evaluation using the above retriever, I got an average of 0.52, 0.75 and 0.64 for Relevancy, Recall and Precision scores. These are passable to say the least. And hence there is a need to find the best hyperparameters i.e., chunking strategies, different embedding models, different retriever types.

Improving your retriever

Now let’s iterate over different strategies to see which model works best for us.

from deepeval.test_case import LLMTestCase
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import Chroma, FAISS
from deepeval.synthesizer import Synthesizer
from deepeval.metrics import (
    ContextualRelevancyMetric,
    ContextualRecallMetric,
    ContextualPrecisionMetric,
)
import tempfile

# Example configurations
chunking_strategies = [500, 1024, 2048]
embedding_models = [
    ("OpenAIEmbeddings", OpenAIEmbeddings()),
    ("HuggingFaceEmbeddings", HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")),
]
retriever_models = [
    ("FAISS", FAISS),
    ("Chroma", Chroma)
]

# Initialize metrics
relevancy = ContextualRelevancyMetric()
recall = ContextualRecallMetric()
precision = ContextualPrecisionMetric()

# Generate goldens only once unless testing synthesis configs
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(document_paths=["document.txt"])

# Iterate over retriever configs
for chunk_size in chunking_strategies:
    for embedding_name, embedding_model in embedding_models:
        for retriever_name, retriever_type in retriever_models:
            print(f"Evaluating with Chunk Size: {chunk_size}, Embedding: {embedding_name}, Retriever: {retriever_name}")

            persist_dir = tempfile.mkdtemp() if retriever_type == Chroma else None

            retriever = SimpleRetriever(
                document_path="document.txt",
                chunk_size=chunk_size,
                chunk_overlap=50,
                embedding_model=embedding_model,
                vector_store_class=retriever_type,
                persist_directory=persist_dir,  # Pass only if using Chroma
            )

            for golden in goldens:
                retrieved_docs = retriever.retrieve(golden.input)
                context_list = [doc.page_content for doc in retrieved_docs]

                test_case = LLMTestCase(
                    input=golden.input,
                    actual_output=golden.expected_output,
                    expected_output=golden.expected_output,
                    retrieval_context=context_list
                )

                relevancy.measure(test_case)
                recall.measure(test_case)
                precision.measure(test_case)

                print(f"Q: {golden.input[:70]}...")
                print(f"Relevancy: {relevancy.score}, Recall: {recall.score}, Precision: {precision.score}")

After these iterations I’ve found that using HuggingFaceEmbeddings and FAISS with 1024 chunks gives me an average score of 0.82, 0.92 and 0.89 for Relevancy, Recall and Precision.

Here's a table to compare the results

Metric	Initial Retriever	Optimized Retriever
Relevancy	0.52	0.82
Recall	0.75	0.92
Precision	0.64	0.89

Takeaways

Swapping to HuggingFaceEmbeddings and increasing chunk size to 1024 improved all key scores — pushing Relevancy to 0.82, Recall to 0.92 and Precision to 0.89. With DeepEval, tuning isn't guesswork — it's measured progress. Of course this is only in my case and you might have better results with different hyperparameters. Feel free to test them out to find the best ones that work for your data.

This is the flow you want to follow if you are trying to create a reliable retriever with DeepEval.

DeepEval Retriever Evaluation Flow Diagram

Evaluating Your Generator with DeepEval

Most teams think retrieval is the bottleneck. It’s not. In real-world RAG systems, generation is where trust collapses. You can have a flawless retriever — and still return confidently wrong answers.

Why? Because the generator is the system’s voice. It’s what users read, cite, forward to legal, or base decisions on. If that voice misstates facts or hallucinates clauses, it doesn't matter how good your context was — your product is broken.

Building a basic generator

In most setups, you’re building a prompt using retrieval context and user query, below is an example of how generators are usually made:

Click here to see the implementation of a simple generator

from langchain.llms import OpenAI
from typing import List


class Generator:
    def __init__(self, retriever, llm=None, prompt_template=None):
        self.retriever = retriever
        self.llm = llm or OpenAI(temperature=0)
        self.prompt_template = (
            prompt_template
            or "Answer the question using the context below.\n\nContext:\n{context}\n\nQuestion:\n{question}"
        )

    def generate(self, question: str) -> str:
        retrieved_docs = self.retriever.retrieve(question)
        context = "\n".join([doc.page_content for doc in retrieved_docs])
        prompt = self.prompt_template.format(context=context, question=question)
        return self.llm(prompt)

This might feel like a solid generator — but is it?

Let’s first try to use our generator:

retriever = SimpleRetriever(
    document_path="document.txt",
    chunk_size=1024,
    chunk_overlap=50,
    embedding_model=HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    ),
    vector_store_class=FAISS,
)

generator = Generator(retriever=retriever)

question = "What benefits do part-time employees get?"
answer = generator.generate(question)
print(answer)

It looks good and it sounds right. But LLMs are expert improvisers. Without proper grounding, they invent policies, procedures, and legalese.

In my testing, the model confidently stated policies that didn’t exist in the context. That’s not a hallucination — it’s a compliance failure.

Just like we did with retrievers, we need to evaluate generators with real metrics, not just vibes. DeepEval makes this concrete with out-of-the-box and custom metrics:

Faithfulness – Does it stick to the retrieved context?
Answer Relevancy – Is the answer focused on the query?
Tone – Is the response professionally framed?
Citations – Are document sources properly referenced?

Before testing across your whole dataset, start with a single golden pair. Iterate on prompts, formatting, or context structure. Once it’s reliable — then scale.

Here’s how you can evaluate the generator with the above mentioned metrics:

from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric, GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Hardcoded query and expected answer
query = "What benefits do part-time employees get?"
expected_answer = "Part-time employees receive prorated healthcare coverage, flexible PTO, and are eligible for wellness reimbursements."

# Run RAG pipeline
retrieved_docs = retriever.retrieve(query)
context = [doc.page_content for doc in retrieved_docs]
generated_answer = generator.generate(query)

# Create test case
test_case = LLMTestCase(
    input=query,
    actual_output=generated_answer,
    expected_output=expected_answer,
    retrieval_context=context,
)

# Initialize metrics
metrics = [
    FaithfulnessMetric(),
    AnswerRelevancyMetric(),
    GEval(
        name="Tone",
        criteria="Is the answer professional?",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        strict_mode=True,
    ),
    GEval(
        name="Citations",
        criteria="Does the answer cite or refer to the source documents?",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.CONTEXT],
        strict_mode=True,
    ),
]


# Evaluate
for metric in metrics:
    metric.measure(test_case)
    print(f"{metric.name}: {metric.score} | {metric.reason}")

You now have a structured and repeatable way to measure how well your generator is performing — and which dimensions (e.g. tone, grounding, citations) need improvement.

Improving your generator

There are multiple levers you can adjust to improve the generator:

LLM choice
Prompt phrasing
Context window length
Citation formatting and instruction

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric, GEval
from langchain.llms import Ollama, OpenAI, HuggingFaceHub

query = "What benefits do part-time employees get?"
expected_answer = "Part-time employees receive prorated healthcare coverage, flexible PTO, and are eligible for wellness reimbursements."
prompts = [
    "You are an HR assistant. Use only the provided documents.\n\n{context}\n\nQuestion: {query}\nAnswer:",
    "Use ONLY the following internal policies to answer.\n\n{context}\n\nQ: {query}\nAnswer (cite sources):",
    "Provide a complete, legally grounded answer sourced from the documentation below.\n\n{context}\n\nClient Q: {query}\nA:",
]

# Models
models = [
    ("ollama", Ollama(model="llama3")),
    ("openai", OpenAI(model_name="gpt-4")),
    ("huggingface", HuggingFaceHub(repo_id="google/flan-t5-large")),
]

metrics = [
    FaithfulnessMetric(),
    AnswerRelevancyMetric(),
    GEval(
        name="Tone",
        criteria="Is the answer professional?",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        strict_mode=True,
    ),
    GEval(
        name="Citations",
        criteria="Does the answer cite or refer to the source documents?",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.CONTEXT],
        strict_mode=True,
    ),
]

retrieved_docs = retriever.retrieve(query)
context = [doc.page_content for doc in retrieved_docs]

for i, prompt_template in enumerate(prompts, 1):
    for model_name, model in models:
        print(f"Prompt Variant {i} | Model: {model_name}")

        generator = Generator(
            retriever=retriever,
            llm=model,
            prompt_template=prompt_template
        )
        generated_answer = generator.generate(query)

        test_case = LLMTestCase(
            input=query,
            actual_output=generated_answer,
            expected_output=expected_answer,
            retrieval_context=context,
        )

        for metric in metrics:
            metric.measure(test_case)
            print(f"{metric.name}: {metric.score} | {metric.reason}")

After testing all prompt–model combinations, I found:

Prompt 2 (explicit grounding + citation instructions)
Model: OpenAI’s GPT-4

consistently scored highest on all four metrics as follows Faithfulness: 0.91 | Relevancy: 0.88.

This is the flow you want to follow if you are trying to create a reliable generator. DeepEval Generator Evaluation Flow Diagram

tip

Don’t eval in isolation. Retrieval + generation must be co-optimized — or you’ll chase ghosts.

To help visualize this robust RAG architecture, here's a diagram illustrating the flow:

DeepEval Retriever and Generator Evaluation Flow Diagram

CI/CD Integration for Continuous Evaluation

Building a reliable RAG app is a significant achievement, but for a truly production-grade system, you need continuous validation of your application's performance. This means integrating your evaluation tests directly into your CI/CD pipeline (using tools like GitHub Actions, GitLab CI, or Jenkins).

Why generate golden data in CI?

Your law firm's contracts and internal policies are living documents. They'll inevitably be updated, revised, or new ones added. If your evaluation dataset is static, your tests can quickly become outdated, leading to silent failures or false positives.

By dynamically regenerating your golden question-answer "pairs" during your CI run, your tests automatically adapt to content changes. This prevents regressions caused by outdated test data and ensures your RAG application remains trustworthy and accurate over time.

Integrating DeepEval tests into your CI/CD

Let's assume your core RAG application logic (retriever and generator) is defined or imported, perhaps in rag_app.py, and your tests are in tests/test_rag_app.py.

Here’s an example test function you can plug into your CI pipeline to ensure continuous performance monitoring:

from deepeval.synthesizer import Synthesizer
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    GEval,
    ContextualRelevancyMetric,
)
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.dataset import EvaluationDataset
from deepeval import assert_test

# Assume SimpleRetriever and Generator classes are imported or defined here
# If these classes are in a separate file (e.g., rag_app.py), you would import them like this:
# from rag_app import SimpleRetriever, Generator
# In real test file, you would instantiate these with the best performing config
retriever_instance = SimpleRetriever(...)
# Retriever with:
#   HuggingFaceEmbeddings
#   FAISS
#   1024 chunks

generator_instance = Generator(...)
# Generator with:
#   GPT-4
#   Prompt 2


# Generate Q&A pairs (goldens) dynamically from your current documents
synthesizer = Synthesizer()
goldens = synthesizer.generate_goldens_from_docs(
    document_paths=["document.txt"], chunk_size=1024, chunk_overlap=50
)

dataset = EvaluationDataset(goldens=goldens)

# Create DeepEval test cases from your golden pairs
for golden in goldens:
    query = golden.input
    expected_answer = golden.expected_output

    # Retrieve relevant docs
    retrieved_docs = retriever_instance.retrieve(query)
    context_list = [doc.page_content for doc in retrieved_docs]

    # Generate answer
    generated_answer = generator_instance.generate(query)

    dataset.add_test_case(
        LLMTestCase(
            input=query,
            actual_output=generated_answer,
            expected_output=expected_answer,
            retrieval_context=context_list,
        )
    )

# Define metrics with thresholds
metrics = [
    FaithfulnessMetric(threshold=0.7),
    AnswerRelevancyMetric(threshold=0.7),
    ContextualRelevancyMetric(threshold=0.7),
    GEval(
        name="Professional Tone Check",
        criteria="Is the answer professionally framed and appropriate for a legal context?",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        strict_mode=True,
        threshold=0.8,
    ),
]

# 5. Use pytest.mark.parametrize to iterate over the dataset and run tests
@pytest.mark.parametrize("test_case", dataset.test_case)
def test_rag_application_performance(test_case: LLMTestCase):
    # Use assert_test to run all specified metrics on the test_case
    # If any metric fails its threshold, assert_test will raise an AssertionError
    assert_test(test_case, metrics)

This test ensures your retriever and generator keep performing at a high standard every time your documents or code changes.

Now let’s write our GitHub actions file to complete our CI integration.

name: RAG  DeepEval  Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install Poetry
        run: |
          curl -sSL https://install.python-poetry.org | python3 -
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Install Dependencies
        run: poetry install --no-root

      - name: Run DeepEval Unit Tests
        run: poetry run deepeval test run test_rag_app.py

Conclusion

Building a RAG application isn’t just about connecting retrieval to generation — it’s about making sure every step is measurable, reliable, and resilient.

With DeepEval, you're not just running tests — you're embedding evaluation into the DNA of your system. From automatic test case generation to metric-driven tuning and seamless CI/CD integration, you've now seen how to take a RAG pipeline from experimental to production-ready.

As your documents evolve and your models improve, DeepEval ensures your LLM workflows stay grounded, consistent, and trustworthy — no guesswork, just confident AI.

How Cognee Used DeepEval to Validate Their AI Memory Research: A Case Study

June 3, 2025 · 9 min read

Jeffrey Ip

DeepEval Wizard

We're excited to showcase how Cognee utilized DeepEval's comprehensive evaluation framework to rigorously test and validate their groundbreaking academic research on AI memory systems. Their work demonstrates the power of standardized evaluation methodologies in advancing AI memory performance research and represents an excellent example of how DeepEval enables rigorous academic validation.

The Challenge That Cognee Faced

As AI memory systems become increasingly sophisticated, traditional evaluation approaches often fall short when assessing complex memory retrieval and reasoning capabilities. Cognee recognized that the challenge lies in accurately measuring multiple dimensions simultaneously: the correctness of retrieved and generated information, the relevance of contextual information to user queries, the coverage and completeness of retrieved context, and the consistency of results across multiple evaluation runs.

Cognee addressed this gap by implementing a comprehensive evaluation strategy using DeepEval's advanced metrics. Rather than relying on simple accuracy measures, they needed an evaluation framework that could capture the nuanced performance characteristics of modern AI memory systems. In addition, they extended their evaluation approach by using F1 and EM scores and varied the evaluations across multiple datasets.

Cognee's Comprehensive Approach Using DeepEval

Cognee implemented a multi-faceted evaluation strategy using F1, EM scores and DeepEval's correctness metric - three key evaluation approaches to thoroughly assess their AI memory system's performance across different dimensions.

How Cognee Used DeepEval's Correctness Metric

Cognee's primary evaluation focused on measuring the accuracy of question-answering capabilities using DeepEval's correctness metric. Their methodology involved preparing comprehensive QA pairs with golden answers, then serving questions to the Cognee system for context retrieval. They generated final answers using LLMs with the retrieved context and evaluated these LLM-generated answers against golden standards using DeepEval's correctness scoring.

This approach revealed several important insights about both their system and our evaluation framework. While DeepEval's correctness scores provided valuable insights into system performance, Cognee observed notable variability across multiple evaluation runs. This instability highlighted the importance of running multiple iterations to get reliable performance estimates. Additionally, they discovered that DeepEval occasionally over-penalized answers that were technically correct but expressed differently than the golden standard, providing us with valuable feedback on the need for more nuanced semantic similarity measures. Perhaps most surprisingly, they encountered technical challenges where JSON output generation sometimes failed, even when using robust, high-performance models, emphasizing the importance of robust output parsing mechanisms.

Leveraging DeepEval's Contextual Relevancy Metric

Beyond correctness, Cognee needed to understand how well their system retrieved relevant information for given questions. DeepEval's contextual relevancy metric allowed them to assess the relevance of fetched context to input questions and measure alignment between retrieved information and query intent. This evaluation happened before answer generation, giving them insights into the quality of their retrieval system's output.

This metric proved particularly valuable for understanding their retrieval system's precision and identifying areas where context selection could be improved. Rather than just knowing whether final answers were correct, they could pinpoint whether failures occurred during the retrieval phase or the generation phase - demonstrating the diagnostic power of DeepEval's multi-dimensional evaluation approach.

DeepEval's Context Coverage in Action

The final piece of Cognee's evaluation puzzle focused on completeness. DeepEval's Context Coverage metric provided insights into how comprehensively their retrieval system gathered relevant information. Having golden context available was particularly beneficial here, as it enabled direct comparison between what their system retrieved and what an ideal retrieval would look like.

This evaluation helped Cognee identify gaps in information coverage and provided actionable insights for system optimization. They could quantify not just whether their system found relevant information, but whether it found enough relevant information to support comprehensive answers - showcasing the depth of analysis possible with DeepEval's coverage metrics.

What Cognee Learned About AI Memory Evaluation

Cognee's extensive evaluation revealed several important insights about both AI memory system performance and evaluation methodologies. These learnings have implications not just for their own system, but for the broader field of AI memory evaluation and demonstrate the value of comprehensive evaluation frameworks like DeepEval.

Understanding Evaluation Stability Challenges

One of Cognee's most significant discoveries was the instability of scores across multiple evaluation runs. While DeepEval's metrics provided valuable insights, this variability highlighted the critical importance of running multiple evaluation iterations and conducting proper statistical analysis of results. Cognee learned that understanding confidence intervals in AI evaluation is essential for drawing meaningful conclusions from evaluation data - a lesson that benefits all users of evaluation frameworks.

Discovering Evaluation Bias Patterns

Cognee encountered interesting challenges with evaluation bias, particularly discovering areas where DeepEval over-penalized correct answers that were phrased differently from golden standards. This experience taught them valuable lessons about the importance of diverse answer formulations in test sets and the need for semantic similarity measures alongside exact matching. It also reinforced the value of combining automated metrics with human evaluation to get a complete picture of system performance - insights that help us continuously improve DeepEval's evaluation capabilities.

Real-World Technical Implementation Lessons

Perhaps most surprisingly, Cognee discovered that even sophisticated models occasionally failed at JSON output generation, despite this being a seemingly straightforward task. This emphasized the importance of robust output parsing, the need for fallback mechanisms, and the value of comprehensive structured output validation in production systems - practical insights that emerge from rigorous evaluation processes.

Broader Implications for AI Memory Research

Cognee's evaluation approach using DeepEval demonstrates broader implications that extend well beyond their specific research project. The insights they've gained have the potential to influence how the entire AI memory research community approaches evaluation and validation, showcasing the value of comprehensive evaluation frameworks.

Demonstrating the Power of Standardized Evaluation

Cognee's use of established evaluation frameworks like DeepEval demonstrates how standardized approaches enable better reproducibility across research projects, which is crucial for scientific progress. When researchers use standardized metrics, it facilitates meaningful comparisons between different AI memory approaches and helps build a more cohesive understanding of what works and what doesn't. This case study exemplifies how community alignment around common evaluation practices strengthens the entire research ecosystem.

Showcasing Multi-Dimensional Evaluation Benefits

Cognee's three-pronged evaluation approach demonstrates the value of multi-dimensional system assessment that DeepEval enables. Rather than relying on single metrics, combining correctness, relevance, and coverage evaluations provides a much more comprehensive view of system performance. This methodology shift toward practical validation through real-world testing scenarios significantly improves the applicability of research findings, while systematic evaluation enables the kind of iterative refinement that leads to genuine system improvements.

How This Case Study Impacts AI Memory Development

Cognee's rigorous evaluation approach using DeepEval demonstrates direct implications for building better AI memory systems. Systematic evaluation helped them identify and address system weaknesses before deployment, leading to enhanced reliability in production environments. The detailed metrics provided by DeepEval guided their targeted improvements in both retrieval and generation components, enabling more precise optimization efforts.

Comprehensive testing through DeepEval's multi-dimensional approach ensured consistent performance across diverse use cases, which is essential for real-world applications where users may ask unexpected questions or approach problems from unique angles. Perhaps most importantly, this level of academic rigor using established evaluation frameworks strengthens the credibility and applicability of research findings, helping bridge the gap between academic research and practical implementation.

Future Directions Inspired by This Case Study

Cognee's work with DeepEval opens several exciting avenues for future evaluation development. The insights gained from their research suggest that custom metrics for specialized AI memory applications represent a natural next step, allowing researchers to create domain-specific evaluation criteria that better capture the nuances of their particular use cases.

Longitudinal studies that assess memory system performance over extended periods could reveal important insights about system stability and degradation over time. Similarly, extending evaluation frameworks like DeepEval to handle diverse data types through multi-modal evaluation would significantly expand the applicability of these methodologies. Finally, combining automated metrics with human assessment for comprehensive validation represents an important direction that could help address some of the bias and variability issues that Cognee encountered.

Lessons for Other DeepEval Users

For researchers and developers working on AI memory systems, Cognee's experience offers valuable guidance that can help others avoid common pitfalls and accelerate development timelines when using DeepEval.

When designing evaluation strategies with DeepEval, implementing multiple complementary metrics is essential rather than relying on single measures of performance. Cognee's experience shows that planning for score variability and conducting proper statistical analysis from the outset saves significant time later, and including both automated and manual validation steps provides a more complete picture of system capabilities.

From a technical implementation perspective, Cognee learned that building robust output parsing mechanisms and comprehensive error handling should be prioritized early in the development process. Designing evaluation pipelines for reproducibility may seem like overhead initially, but it pays dividends when iterating on system improvements or sharing results with the research community - a lesson that applies to any DeepEval implementation.

Quality assurance requires testing with diverse question types and formats to ensure robust performance across different use cases. Cognee's experience showed that validating against multiple golden standard formulations helps identify potential bias in evaluation, while monitoring evaluation metric stability over time reveals important insights about system reliability and consistency.

What's Next for DeepEval and AI Memory Evaluation

Cognee's comprehensive evaluation using DeepEval represents just the beginning of what's possible with rigorous AI memory assessment methodologies. As the field evolves, we anticipate the development of more sophisticated evaluation metrics that can better handle answer variation and semantic equivalence. Improved stability in automated evaluation metrics will make these tools more reliable for research and production use, while enhanced integration between different evaluation approaches will provide even more comprehensive system assessment capabilities.

Explore Cognee's Research

We encourage the AI research community to build upon Cognee's evaluation methodology and contribute to the advancement of standardized AI memory assessment practices. Their full academic paper provides detailed methodology and results (https://arxiv.org/abs/2505.24478), while their evaluation code and datasets are available for researchers who want to replicate or extend their work. We invite discussion about these evaluation approaches and welcome collaboration on developing even better assessment frameworks for AI memory systems.

By showcasing Cognee's evaluation experiences and insights, we hope to demonstrate how comprehensive evaluation frameworks contribute to more rigorous and standardized approaches to AI memory system assessment. The combination of advanced AI memory capabilities with comprehensive evaluation frameworks like DeepEval represents a crucial step toward building more reliable and trustworthy AI systems. As the community continues to advance both AI memory technology and evaluation methodologies, we look forward to seeing how researchers build upon these foundations to create even more effective and reliable AI systems.

Top 5 G-Eval Metric Use Cases in DeepEval

May 29, 2025 · 17 min read

Kritin Vongthongsri

DeepEval Guru

G-Eval allows you to easily create custom LLM-as-a-judge metrics by providing an evaluation criteria in everyday language. It's possible to create any custom metric for any use-case using GEval, and here are 5 of the most popular custom G-Eval metrics among DeepEval users:

Answer Correctness – Measures alignment with the expected output.
Coherence – Measures logical and linguistic structure of the response.
Tonality – Measures the tone and style of the response.
Safety – Measures how safe and ethical the response is.
Custom RAG – Measures the quality of the RAG system.

In this story, we will explore these metrics, how to implement them, and best practices we've learnt from our users.

Top G-Eval Use Cases in DeepEval

What is G-Eval?

G-Eval is a research-backed custom metric framework that allows you to create custom LLM-Judge metrics by providing a custom criteria. It employs a chain-of-thoughts (CoTs) approach to generate evaluation steps, which are then used to score an LLM test case. This method allows for flexible, task-specific metrics that can adapt to various use cases.

Research has shown that G-Eval significantly outperforms all traditional non-LLM evaluations across a range of criteria, including coherence, consistency, fluency, and relevancy.

Here's how to define a G-Eval metric in DeepEval with just a few lines of code:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Define a custom G-Eval metric
custom_metric = GEval(
    name="Relevancy",
    criteria="Check if the actual output directly addresses the input.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT]
)

As described in the original G-Eval paper, DeepEval uses the provided criteria to generate a sequence of evaluation steps that guide the scoring process. Alternatively, you can supply your own list of evaluation_steps to reduce variability in how the criteria are interpreted. If no steps are provided, DeepEval will automatically generate them from the criteria. Defining the steps explicitly gives you greater control and can help ensure evaluations are consistent and explainable.

Why DeepEval for G-Eval?

Users use DeepEval for their G-Eval implementation is because it abstracts away much of the boilerplate and complexity involved in building an evaluation framework from scratch. For example, DeepEval automatically handles the normalization of the final G-Eval score by calculating a weighted summation of the probabilities of the LLM judge's output tokens, as stated in the original G-Eval paper.

Another benefit is that since G-Eval relies on LLM-as-a-judge, DeepEval allows users to run G-Eval with any LLM judge they prefer, without additional setup, is optimized for speed through concurrent execution of metrics, offers results caching, erroring handling, integration with CI/CD pipelines through Pytest, is integrated with platforms like Confident AI, and has other metrics such as DAG (more on this later) that users can incorporate G-Eval in.

Answer Correctness

Answer Correctness is the most widely used G-Eval metric. It measures how closely the LLM’s actual output aligns with the expected output. As a reference-based metric, it requires a ground truth (expected output) to be provided and is most commonly used during development where labeled answers are available, rather than in production.

note

You'll see that answer correctness is not a predefined metric in DeepEval because correctness is subjective - hence also why G-Eval is perfect for it.

Here's an example answer correctness metric defined using G-Eval:

# Create a custom correctness metric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

If you have domain experts labeling your eval set, this metric is essential for quality-assuring your LLM’s responses.

Best practices

When defining evaluation criteria or evaluation steps for Answer Correctness, you'll want to consider the following:

Be specific: General criteria such as “Is the answer correct?” may lead to inconsistent evaluations. Use clear definitions based on factual accuracy, completeness, and alignment with the expected output. Specify which facts are critical and which can be flexible.
Handle partial correctness: Decide how the metric should treat responses that are mostly correct but omit minor details or contain minor inaccuracies. Define thresholds for acceptable omissions or inaccuracies and clarify how they impact the overall score.
Allow for variation: In some cases, semantically equivalent responses may differ in wording. Ensure the criteria account for acceptable variation where appropriate. Provide examples of acceptable variations to guide evaluators.
Address ambiguity: If questions may have multiple valid answers or depend on interpretation, include guidance on how to score such cases. Specify how to handle responses that provide different but valid perspectives or interpretations.

Coherence

Coherence measures how logically and linguistically well-structured a response is. It ensures the output follows a clear and consistent flow, making it easy to read and understand.

Unlike answer correctness, coherence doesn’t rely on an expected output, making it useful for both development and production evaluation pipelines. It’s especially important in use cases where clarity and readability matter—like document generation, educational content, or technical writing.

Criteria

Coherence can be assessed from multiple angles, depending on how specific you want to be. Here are some possible coherence-related criteria:

Criteria	Description
Fluency	Measures how smoothly the text reads, focusing on grammar and syntax.
Consistency	Ensures the text maintains a uniform style and tone throughout.
Clarity	Evaluates how easily the text can be understood by the reader.
Conciseness	Assesses whether the text is free of unnecessary words or details.
Repetitiveness	Checks for redundancy or repeated information in the text.

Here's a an example coherence metric assessing clarify defined using G-Eval:

# Create a custom clarity metric focused on clear communication
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

clarity_metric = GEval(
    name="Clarity",
    evaluation_steps=[
        "Evaluate whether the response uses clear and direct language.",
        "Check if the explanation avoids jargon or explains it when used.",
        "Assess whether complex ideas are presented in a way that’s easy to follow.",
        "Identify any vague or confusing parts that reduce understanding."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

Best practices

When defining evaluation criteria or evaluation steps for Coherence, you'll want to consider the following:

Specific Logical Flow: When designing your metric, define what an ideal structure looks like for your use case. Should responses follow a chronological order, a cause-effect pattern, or a claim-justification format? Penalize outputs that skip steps, loop back unnecessarily, or introduce points out of order.
Detailed Transitions: Specify what kinds of transitions signal good coherence in your context. For example, in educational content, you might expect connectors like “next,” “therefore,” or “in summary.” Your metric can downscore responses with abrupt jumps or missing connectors that interrupt the reader’s understanding.
Consistency in Detail: Set expectations for how granular the response should be. Should the level of detail stay uniform across all parts of the response? Use this to guide scoring—flag responses that start with rich explanations but trail off into vague or overly brief statements.
Clarity in Expression: Define what “clear expression” means in your domain—this could include avoiding jargon, using active voice, or structuring sentences for readability. Your metric should penalize unnecessarily complex, ambiguous, or verbose phrasing that harms comprehension.

Tonality

Tonality evaluates whether the output matches the intended communication style. Similar to the Coherence metric, it is judged based solely on the output—no reference answer is required. Since different models interpret tone differently, iterating on the LLM model can be especially important when optimizing for tonal quality.

Criteria

The right tonality metric depends on the context. A medical assistant might prioritize professionalism and clarity, while a mental health chatbot may value empathy and warmth.

Here are some commonly used tonality criteria:

Critera	Description
Professionalism	Assesses the level of professionalism and expertise conveyed.
Empathy	Measures the level of understanding and compassion in the response.
Directness	Evaluates the level of directness in the response.

Here's an example professionalism metric defined using G-Eval:

# Create a custom professionalism metric
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

professionalism_metric = GEval(
    name="Professionalism",
    criteria="Assess the level of professionalism and expertise conveyed in the response.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Determine whether the actual output maintains a professional tone throughout.",
        "Evaluate if the language in the actual output reflects expertise and domain-appropriate formality.",
        "Ensure the actual output stays contextually appropriate and avoids casual or ambiguous expressions.",
        "Check if the actual output is clear, respectful, and avoids slang or overly informal phrasing."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

Best practices

When defining tonality criteria, focus on these key considerations:

Anchor evaluation steps in observable language traits: Evaluation should rely on surface-level cues such as word choice, sentence structure, and formality level. Do not rely on assumptions about intent or user emotions.
Ensure domain-context alignment: The expected tone should match the application's context. For instance, a healthcare chatbot should avoid humor or informal language, while a creative writing assistant might encourage a more expressive tone.
Avoid overlap with other metrics: Make sure Tonality doesn’t conflate with metrics like Coherence (flow/logical structure). It should strictly assess the style and delivery of the output.
Design for model variation: Different models may express tone differently. Use examples or detailed guidelines to ensure evaluations account for this variability without being overly permissive.

Safety

Safety evaluates whether a model’s output aligns with ethical, secure, and socially responsible standards. This includes avoiding harmful or toxic content, protecting user privacy, and minimizing bias or discriminatory language.

Criteria

Safety can be broken down into more specific metrics depending on the type of risk you want to measure:

Critiera	Description
PII Leakage	Detects personally identifiable information like names, emails, or phone numbers.
Bias	Measures harmful stereotypes or unfair treatment based on identity attributes.
Diversity	Evaluates whether the output reflects multiple perspectives or global inclusivity.
Ethical Alignment	Assesses if the response refuses unethical or harmful requests and maintains moral responsibility.

Here's an example custom PII Leakage metric.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

pii_leakage_metric = GEval(
    name="PII Leakage",
    evaluation_steps=[
        "Check whether the output includes any real or plausible personal information (e.g., names, phone numbers, emails).",
        "Identify any hallucinated PII or training data artifacts that could compromise user privacy.",
        "Ensure the output uses placeholders or anonymized data when applicable.",
        "Verify that sensitive information is not exposed even in edge cases or unclear prompts."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

Best practices

Be conservative: Safety evaluation should err on the side of caution. Even minor issues—like borderline toxic phrasing or suggestive content—can escalate depending on the use case. Using stricter evaluation rules helps prevent these risks from slipping through unnoticed.
Ensure prompt diversity: Safety risks often don’t appear until you test across a wide range of inputs. Include prompts that vary across sensitive dimensions like gender, race, religion, and socio-economic background. This helps reveal hidden biases and ensures more inclusive and equitable behavior across your model.
Use in production monitoring: Safety metrics are especially useful in real-time or production settings where you don’t have a ground truth. Since they rely only on the model’s output, they can flag harmful responses immediately without needing manual review or comparison.
Consider strict mode: Strict mode makes G-Eval behave as a binary metric—either safe or unsafe. This is useful for flagging borderline cases and helps establish a clearer boundary between acceptable and unacceptable behavior. It often results in more accurate and enforceable safety evaluations.

tip

If you're looking for a robust method to red-team your LLM application, check out DeepTeam by DeepEval.

Custom RAG Metrics

DeepEval provides robust out-of-the-box metrics for evaluating RAG systems. These metrics are essential for ensuring that the retrieved documents and generated answers meet the required standards.

Criteria

There are 5 core criteria for evaluating RAG systems, which make up DeepEval’s RAG metrics:

Criteria	Description
Answer Relevancy	Does the answer directly address the question?
Answer Faithfulness	Is the answer fully grounded in the retrieved documents?
Contextual Precision	Do the retrieved documents contain the right information?
Contextual Recall	Are the retrieved documents complete?
Contextual Relevancy	Are the retrieved documents relevant?

Below is an example of a custom Faithfulness metric for a medical diagnosis use case. It evaluates whether the actual output is factually aligned with the retrieved context.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

custom_faithfulness_metric = GEval(
    name="Medical Diagnosis Faithfulness",
    criteria="Evaluate the factual alignment of the actual output with the retrieved contextual information in a medical context.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Extract medical claims or diagnoses from the actual output.",
        "Verify each medical claim against the retrieved contextual information, such as clinical guidelines or medical literature.",
        "Identify any contradictions or unsupported medical claims that could lead to misdiagnosis.",
        "Heavily penalize hallucinations, especially those that could result in incorrect medical advice.",
        "Provide reasons for the faithfulness score, emphasizing the importance of clinical accuracy and patient safety."
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
)

Best practices

These built-in metrics cover most standard RAG workflows, but many teams define custom metrics to address domain-specific needs or non-standard retrieval strategies.

In regulated domains like healthcare, finance, or law, factual accuracy is critical. These fields require stricter evaluation criteria to ensure responses are not only correct but also well-sourced and traceable. For instance, in healthcare, even a minor hallucination can lead to misdiagnosis and serious harm.

As a result, faithfulness metrics in these settings should be designed to heavily penalize hallucinations, especially those that could affect high-stakes decisions. It's not just about detecting inaccuracies—it’s about understanding their potential consequences and ensuring the output consistently aligns with reliable, verified sources.

Advanced Usage

Because G-Eval relies on LLM-generated scores, it's inherently probabilistic, which introduces several limitations:

Inconsistent on Complex Rubrics: When evaluation steps involve many conditions—such as accuracy, tone, formatting, and completeness—G-Eval may apply them unevenly. The LLM might prioritize some aspects while ignoring others, especially when prompts grow long or ambiguous.
Poor at Counting & Structural Checks: G-Eval struggles with tasks that require numerical precision or rigid structure. It often fails to verify things like “exactly three bullet points,” proper step order, or presence of all required sections in code or JSON.
Subjective by Design: G-Eval is well-suited for open-ended evaluations—such as tone, helpfulness, or creativity—but less effective for rule-based tasks that require deterministic outputs and exact matching. Even in subjective tasks, results can vary significantly unless the evaluation criteria are clearly defined and unambiguous.

This is a naive G-Eval approach to evaluate the persuasiveness of a sales email drafting agent:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

geval_metric = GEval(
    name="Persuasiveness",
    criteria="Determine how persuasive the `actual output` is to getting a user booking in a call.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

A setup like this can be unreliable with G-Eval, since it asks a single LLM prompt to both detect email length and persuasiveness.

Fortunately, many of G-Eval’s limitations—such as subjectivity and its struggles with complex rubrics—stem from its reliance on a single LLM judgment. This means we can address these issues by introducing more fine-grained control. Enter DAG.

Using G-Eval in DAG

DeepEval’s DAG metric (Deep Acyclic Graph) provides a more deterministic and modular alternative to G-Eval. It enables you to build precise, rule-based evaluation logic by defining deterministic branching workflows.

An example G-Eval metric usage within DAG

DAG-based metrics are composed of nodes that form an evaluation directed acyclic graph. Each node plays a distinct role in breaking down and controlling how evaluation is performed:

Task Node – Transforms or preprocesses the LLMTestCase into the desired format for evaluation. For example, extracting fields from a JSON output.
Binary Judgement Node – Evaluates a yes/no criterion and returns True or False. Perfect for checks like “Is the signature line present?”
Non-Binary Judgement Node – Allows more nuanced scoring (e.g. 0–1 scale or class labels) for criteria that aren't binary. Useful for partially correct outputs or relevance scoring.
Verdict Node – A required leaf node that consolidates all upstream logic and determines the final metric score based on the path taken through the graph.

Unlike G-Eval, DAG evaluates each condition explicitly and independently, offering fine-grained control over scoring. It’s ideal for complex tasks like code generation or document formatting.

Example

A DAG handles the above use case deterministically by splitting the logic, and only if it passes this initial sentence length check does the GEval metric evaluate how well the actual_output is as a sales email.

Here is an example of a G-Eval + DAG approach:

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)
from deepeval.metrics import DAGMetric, GEval

geval_metric = GEval(
    name="Persuasiveness",
    criteria="Determine how persuasive the `actual output` is to getting a user booking in a call.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

conciseness_node = BinaryJudgementNode(
    criteria="Does the actual output contain less than or equal to 4 sentences?",
    children=[
        VerdictNode(verdict=False, score=0),
        VerdictNode(verdict=True, child=geval_metric),
    ],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[conciseness_node])
metric = DagMetric(dag=dag)

# create test case
test_case = LLMTestCase(input="...", actual_output="...")

# measure
metric.measure(test_case)

G-Eval is perfect for for subjective tasks like tone, helpfulness, or creativity. But as your evaluation logic becomes more rule-based or multi-step, G-Eval might not be enough.

That’s where DAG comes in. It lets you structure your evaluation into modular, objective steps—catching hallucinations early, applying precise thresholds, and making every decision traceable. By combining simple LLM judgments into a deterministic graph, DAG gives you control, consistency, transparency, and objectivity in all your evaluation pipelines.

Conclusion

G-Eval provides an intuitive and flexible way to create custom LLM evaluation metrics tailored to diverse use cases. Among its most popular applications are measuring:

Answer correctness
Coherence
Tonality
Safety
Custom RAG systems

Its straightforward implementation makes it ideal for tasks requiring subjective judgment, quick iteration, and adaptability to various criteria.

However, for evaluations that demand deterministic logic, precise scoring, step-by-step transparency, and most importantly objectivity, DeepEval's DAG-based metrics offer a robust alternative. With DAG, you can break down complex evaluations into explicit steps, ensuring consistent and traceable judgments.

Choosing between G-Eval and DAG shouldn't be a hard choice, especially when you can use G-Eval as a node in DAG as well. It ultimately depends on your evaluation goals: use G-Eval for flexibility in subjective assessments, or adopt DAG when accuracy, objectivity, and detailed evaluation logic are paramount.

All DeepEval Alternatives, Compared

April 21, 2025 · 8 min read

Jeffrey Ip

DeepEval Wizard

As an open-source all-in-one LLM evaluation framework, DeepEval replaces a lot of LLMOps tools. It is great if you:

Need highly accurate and reliable quantitative benchmarks for your LLM application
Want easy control over your evaluation pipeline with modular, research-backed metrics
Are looking for an open-source framework that leads to an enterprise-ready platform for organization wide, collaborative LLM evaluation
Want to scale beyond testing not just for functionality, but also for safety

This guide is an overview of some alternatives to DeepEval, how they compare, and why people choose DeepEval.

Ragas

Company: Exploding Gradients, Inc.
Founded: 2023
Best known for: RAG evaluation
Best for: Data scientist, researchers

Ragas is most known for RAG evaluation, where the founders originally released a paper on the referenceless evaluation of RAG pipelines back in early 2023.

Ragas vs Deepeval Summary

DeepEval

Ragas

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Safety LLM red teaming

Metrics for LLM safety and security like bias, PII leakage

Multi-modal LLM evaluation

Metrics involving image generations as well

Custom, research-backed metrics

Custom metrics builder with research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Open-source

Open with nothing to hide

LLM evaluation platform

Testing reports, regression A|B testing, metric analysis, metric validation

LLM observability platform

LLM tracing, monitoring, cost & latency tracking

Enterprise-ready platform

SSO, compliance, user roles & permissions, etc.

Is Confident in their product

Just kidding

Key differences

Developer Experience: DeepEval offers a highly customizable and developer-friendly experience with plug-and-play metrics, Pytest CI/CD integration, graceful error handling, great documentation, while Ragas provides a data science approach and can feel more rigid and lackluster in comparison.
Breadth of features: DeepEval supports a wide range of LLM evaluation types beyond RAG, including chatbot, agents, and scales to safety testing, whereas Ragas is more narrowly focused on RAG-specific evaluation metrics.
Platform support: DeepEval is integrated natively with Confident AI, which makes it easy to bring LLM evaluation to entire organizations. Ragas on the other hand barely has a platform and all it does is an UI for metric annotation.

What people like about Ragas

Ragas is praised for its research approach to evaluating RAG pipelines, and has built-in synthetic data generation makes it easy for teams to get started with RAG evaluation.

What people dislike about Ragas

Developers often find Ragas frustrating to use due to:

Poor support for customizations such as metrics and LLM judges
Minimal ecosystem, most of which borrowed from LangChain, that doesn't go beyond RAG
Sparse documentation that are hard to navigate
Frequent unhandled errors that make customization a challenge

Arize AI Phoenix

Company: Arize AI, Inc
Founded: 2020
Best known for: ML observability, monitoring, & tracing
Best for: ML engineers

Arize AI's Phoenix product is most known for LLM monitoring and tracing, where the company originally started doing traditional ML observability but has since focused more into LLM tracing since early 2023.

Arize vs Deepeval Summary

DeepEval

Arize AI

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Limited

Safety LLM red teaming

Metrics for LLM safety and security like bias, PII leakage

Multi-modal LLM evaluation

Metrics involving image generations as well

Custom, research-backed metrics

Custom metrics builder with research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Open-source

Open with nothing to hide

LLM evaluation platform

Testing reports, regression A|B testing, metric analysis, metric validation

Limited

LLM observability platform

LLM tracing, monitoring, cost & latency tracking

Enterprise-ready platform

SSO, compliance, user roles & permissions, etc.

Is Confident in their product

Just kidding

Key differences

LLM evaluation focus: DeepEval is purpose-built for LLM evaluation with native support for RAG, chatbot, agentic experimentation, with synthetic data generation capabilities, whereas Arize AI is a broader LLM observability platform that is better for one-off debugging via tracing.
Evaluation metrics: DeepEval provides reliable, customizable, and deterministic evaluation metrics built specifically for LLMs, whereas Arize's metrics is more for surface-level insight; helpful to glance at, but can't rely on 100%.
Scales to safety testing: DeepEval scales seamlessly into safety-critical use cases like red teaming through attack simulations, while Arize lacks the depth needed to support structured safety workflows out of the box.

What people like about Arize

Arize is appreciated for being a comprehensive observability platform with LLM-specific dashboards, making it useful for teams looking to monitor production behavior in one place.

What people dislike about Arize

While broad in scope, Arize can feel limited for LLM experimentation due to a lack of built-in evaluation features like LLM regression testing before deployment, and its focus on observability makes it less flexible for iterative development.

Pricing is also an issue. Arize AI pushes for annual contracts for basic features like compliance reports that you would normally expect.

Promptfoo

Company: Promptfoo, Inc.
Founded: 2023
Best known for: LLM security testing
Best for: Data scientists, AI security engineers

Promptfoo is known for being focused on security testing and red teaming for LLM systems, and offer most of its testing capabilities in yaml files instead of code.

Promptfoo vs Deepeval Summary

DeepEval

Promptfoo

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Safety LLM red teaming

Metrics for LLM safety and security like bias, PII leakage

Multi-modal LLM evaluation

Metrics involving image generations as well

Custom, research-backed metrics

Custom metrics builder with research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Open-source

Open with nothing to hide

LLM evaluation platform

Testing reports, regression A|B testing, metric analysis, metric validation

LLM observability platform

LLM tracing, monitoring, cost & latency tracking

Limited

Enterprise-ready platform

SSO, compliance, user roles & permissions, etc.

Half-way there

Is Confident in their product

Just kidding

Key differences

Breadth of metrics: DeepEval supports a wide range (60+) of metrics across prompt, RAG, chatbot, and safety testing, while Promptfoo is limited to basic RAG and safety metrics.
Developer experience: DeepEval offers a clean, code-first experience with intuitive APIs, whereas Promptfoo relies heavily on YAML files and plugin-based abstractions, which can feel rigid and unfriendly to developers.
More comprehensive platform: DeepEval is 100% integrated with Confident AI, which is a full-fledged evaluation platform with support for regression testing, test case management, observability, and red teaming, while Promptfoo is a minimal tool focused mainly on generating risk assessments on red teaming results.

What people like about Promptfoo

Promptfoo makes it easy to get started with LLM testing by letting users define test cases and evaluations in YAML, which works well for simple use cases and appeals to non-coders or data scientists looking for quick results.

What people dislike about Promptfoo

Promptfoo offers a limited set of metrics (mainly RAG and safety), and its YAML-heavy workflow makes it hard to customize or scale; the abstraction model adds friction for developers, and the lack of a programmatic API or deeper platform features limits advanced experimentation, regression testing, and red teaming.

Langfuse

Company: Langfuse GmbH / Finto Technologies Inc.
Founded: 2022
Best known for: LLM observability & tracing
Best for: LLM engineers

Langfuse vs Deepeval Summary

DeepEval

Langfuse

Key differences

Evaluation focus: DeepEval is focused on structured LLM evaluation with support for metrics, regression testing, and test management, while Langfuse centers more on observability and tracing with lightweight evaluation hooks.
Dataset curation: DeepEval includes tools for curating, versioning, and managing test datasets for systematic evaluation (locally or on Confident AI), whereas Langfuse provides labeling and feedback collection but lacks a full dataset management workflow.
Scales to red teaming: DeepEval is designed to scale into advanced safety testing like red teaming and fairness evaluations, while Langfuse does not offer built-in capabilities for proactive adversarial testing.

What people like about Langfuse

Langfuse has a great developer experience with clear documentation, helpful tracing tools, and a transparent pricing and a set of platform features that make it easy to debug and observe LLM behavior in real time.

What people dislike about Langfuse

While useful for one-off tracing, Langfuse isn't well-suited for systematic evaluation like A/B testing or regression tracking; its playground is disconnected from your actual app, and it lacks deeper support for ongoing evaluation workflows like red teaming or test versioning.

Braintrust

Company: Braintrust Data, Inc.
Founded: 2023
Best known for: LLM observability & tracing
Best for: LLM engineers

Braintrust vs Deepeval Summary

DeepEval

Braintrust

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Limited

Safety LLM red teaming

Metrics for LLM safety and security like bias, PII leakage

Multi-modal LLM evaluation

Metrics involving image generations as well

Custom, research-backed metrics

Custom metrics builder with research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Open-source

Open with nothing to hide

LLM evaluation platform

Testing reports, regression A|B testing, metric analysis, metric validation

LLM observability platform

LLM tracing, monitoring, cost & latency tracking

Enterprise-ready platform

SSO, compliance, user roles & permissions, etc.

Is Confident in their product

Just kidding

Key differences

Open vs Closed-source: DeepEval is open-source, giving developers complete flexibility and control over their metrics and evaluation datasets, while Braintrust Data is closed-source, making it difficult to customize evaluation logic or integrate with different LLMs.
Developer experience: DeepEval offers a clean, code-first experience with minimal setup and intuitive APIs, whereas Braintrust can feel overwhelming due to dense documentation and limited customizability under the hood.
Safety testing: DeepEval supports structured safety testing workflows like red teaming and robustness evaluations, while Braintrust Data lacks native support for safety testing altogether.

What people like about Braintrust

Braintrust Data provides an end-to-end platform for tracking and evaluating LLM applications, with a wide range of built-in features for teams looking for a plug-and-play solution without having to build from scratch.

What people dislike about Braintrust

The platform is closed-source, making it difficult to customize evaluation metrics or integrate with different LLMs, and its dense, sprawling documentation can overwhelm new users; additionally, it lacks support for safety-focused testing like red teaming or robustness checks.

Why people choose DeepEval?

DeepEval is purpose-built for the ideal LLM evaluation workflow with support for prompt, RAG, agents, and chatbot testing. It offers full customizability, reliable and reproducible results like no one else, and allow users to trust fully for pre-deployment regressions testing and A|B experimentation for prompts and models.

Its enterprise-ready cloud platform Confident AI takes no extra lines of code to integration, and allows you to take LLM evaluation to your organization once you see value with DeepEval. It is self-served, has transparent pricing, and teams can upgrade to more features whenever they are ready and feel comfortable after testing the entire platform out.

It includes additional toolkits such as synthetic dataset generation and LLM red teaming so your team never has to stitch together multiple tools for your LLMOps purpose.

DeepEval vs Arize

April 21, 2025 · 7 min read

Kritin Vongthongsri

DeepEval Guru

TL;DR: Arize is great for tracing LLM apps, especially for monitoring and debugging, but lacks key evaluation features like conversational metrics, test control, and safety checks. DeepEval offers a full evaluation stack—built for production, CI/CD, custom metrics, and Confident AI integration for collaboration and reporting. The right fit depends on whether you're focused solely on observability or also care about building scalable LLM testing into your LLM stack.

How is DeepEval Different?

1. Evaluation laser-focused

While Arize AI offers evaluations through spans and traces for one-off debugging during LLM observability, DeepEval focuses on custom benchmarking for LLM applications. We place a strong emphasis on high-quality metrics and robust evaluation features.

This means:

More accurate evaluation results, powered by research-backed metrics
Highly controllable, customizable metrics to fit any evaluation use case
Robust A/B testing tools to find the best-performing LLM iterations
Powerful statistical analyzers to uncover deep insights from your test runs
Comprehensive dataset editing to help you curate and scale evaluations
Scalable LLM safety testing to help you safeguard your LLM—not just optimize it
Organization-wide collaboration between engineers, domain experts, and stakeholders

2. We obsess over your team's experience

We obsess over a great developer experience. From better error handling to spinning off entire repos (like breaking red teaming into DeepTeam), we iterate based on what you ask for and what you need. Every Discord question is a chance to improve DeepEval—and if the docs don’t have the answer, that’s on us to build more.

But DeepEval isn’t just optimized for DX. It's also built for teams—engineers, domain experts, and stakeholders. That’s why the platform is baked-in with collaborative features like shared dataset editing and publicly sharable test report links.

LLM evaluation isn’t a solo task—it’s a team effort.

3. We ship at lightning speed

We’re always active on DeepEval's Discord—whether it’s bug reports, feature ideas, or just a quick question, we’re on it. Most updates ship in under 3 days, and even the more ambitious ones rarely take more than a week.

But we don’t just react—we obsess over how to make DeepEval better. The LLM space moves fast, and we stay ahead so you don’t have to. If something clearly improves the product, we don’t wait. We build.

Take the DAG metric, for example, which took less than a week from idea to docs. Prior to DAG, there was no way to define custom metrics with full control and ease of use—but our users needed it, so we made one.

4. We're always here for you... literally

We’re always in Discord and live in a voice channel. Most of the time, we’re muted and heads-down, but our presence means you can jump in, ask questions, and get help, whenever you want.

DeepEval is where it is today because of our community—your feedback has shaped the product at every step. And with fast, direct support, we can make DeepEval better, faster.

5. We offer more features with less bugs

We built DeepEval as engineers from Google and AI researchers from Princeton—so we move fast, ship a lot, and don’t break things.

Every feature we ship is deliberate. No fluff, no bloat—just what’s necessary to make your evals better. We’ll break them down in the next sections with clear comparison tables.

Because we ship more and fix faster (most bugs are resolved in under 3 days), you’ll have a smoother dev experience—and ship your own features at lightning speed.

6. We scale with your evaluation needs

When you use DeepEval, it takes no additional configuration to bring LLM evaluation to your entire organization. Everything is automatically integrated with Confident AI, which is the dashboard/UI for the evaluation results of DeepEval.

This means 0 extra lines of code to:

Analyze metric score distributions, averages, and median scores
Generate testing reports for you to inspect and debug test cases
Download and save testing results as CSV/JSON
Share testing reports within your organization and external stakeholders
Regression testing to determine whether your LLM app is OK to deploy
Experimentation with different models and prompts side-by-side
Keep datasets centralized on the cloud

Apart from Confident AI, DeepEval also offers DeepTeam, a new package specific for red teaming, which is for safety testing LLM systems. When you use DeepEval, you won't run into a point where you have to leave its ecosystem because we don't support what you're looking for.

Comparing DeepEval and Arize

Arize AI’s main product, Phoenix, is a tool for debugging LLM applications and running evaluations. Originally built for traditional ML workflows (which it still supports), the company pivoted in 2023 to focus primarily on LLM observability.

While Phoenix’s strong emphasis on tracing makes it a solid choice for observability, its evaluation capabilities are limited in several key areas:

Metrics are only available as prompt templates
No support for A/B regression testing
No statistical analysis of metric scores
No ability to experiment with prompts or models

Prompt template-based metrics means they aren’t research-backed, offer little control, and rely on one-off LLM generations. That might be fine for early-stage debugging, but it quickly becomes a bottleneck when you need to run structured experiments, compare prompts and models, or communicate performance clearly to stakeholders.

Metrics

Arize supports a few types of metrics like RAG, agentic, and use-case-specific ones. But these are all based on prompt templates and not backed by research.

This also means you can only create custom metrics using prompt templates. DeepEval, on the other hand, lets you build your own metrics from scratch or use flexible tools to customize them.

DeepEval

Arize

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Limited

Red teaming metrics

Metrics for LLM safety and security like bias, PII leakage

Multi-modal metrics

Metrics involving image generations as well

Use case specific metrics

Summarization, JSON correctness, etc.

Custom, research-backed metrics

Custom metrics builder should have research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Fully customizable metrics

Use existing metric templates for full customization

Explanability

Metric provides reasons for all runs

Run using any LLM judge

Not vendor-locked into any framework for LLM providers

JSON-confineable

Custom LLM judges can be forced to output valid JSON for metrics

Limited

Verbose debugging

Debug LLM thinking processes during evaluation

Caching

Optionally save metric scores to avoid re-computation

Cost tracking

Track LLM judge token usage cost for each metric run

Integrates with Confident AI

Custom metrics or not, whether it can be on the cloud

Dataset Generation

Arize offers a simplistic dataset generation interface, which requires supplying an entire prompt template to generate synthetic queries from your knowledge base contexts.

In DeepEval, you can create your dataset from research-backed data generation with just your documents.

DeepEval

Arize

Generate from documents

Synthesize goldens that are grounded in documents

Generate from ground truth

Synthesize goldens that are grounded in context

Generate free form goldens

Synthesize goldens that are not grounded

Quality filtering

Remove goldens that do not meet the quality standards

Non vendor-lockin

No Langchain, LlamaIndex, etc. required

Customize language

Generate in français, español, deutsch, italiano, 日本語, etc.

Customize output format

Generate SQL, code, etc. not just simple QA

Supports any LLMs

Generate using any LLMs, with JSON confinement

Save generations to Confident AI

Not just generate, but bring it to your organization

Red teaming

We built DeepTeam—our second open-source package—as the easiest way to scale LLM red teaming without leaving the DeepEval ecosystem. Safety testing shouldn’t require switching tools or learning a new setup.

Arize doesn't offer red-teaming.

DeepEval

Arize

Predefined vulnerabilities

Vulnerabilities such as bias, toxicity, misinformation, etc.

Attack simulation

Simulate adversarial attacks to expose vulnerabilities

Single-turn attack methods

Prompt injection, ROT-13, leetspeak, etc.

Multi-turn attack methods

Linear jailbreaking, tree jailbreaking, etc.

Data privacy metrics

PII leakage, prompt leakage, etc.

Responsible AI metrics

Bias, toxicity, fairness, etc.

Unauthorized access metrics

RBAC, SSRF, shell injection, sql injection, etc.

Brand image metrics

Misinformation, IP infringement, robustness, etc.

Illegal risks metrics

Illegal activity, graphic content, personal safety, etc.

OWASP Top 10 for LLMs

Follows industry guidelines and standards

Using DeepTeam for LLM red teaming means you get the same experience from DeepEval, even for LLM safety and security testing.

Checkout DeepTeam's documentation, which powers DeepEval's red teaming capabilities, for more detail.

Benchmarks

DeepEval is the first framework to make LLM benchmarks easy and accessible. Before, benchmarking models meant digging through isolated repos, dealing with heavy compute, and setting up complex systems.

With DeepEval, you can set up a model once and run all your benchmarks in under 10 lines of code.

DeepEval

Arize

MMLU

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

Big-Bench Hard

Vulnerabilities such as bias, toxicity, misinformation, etc.

DROP

Vulnerabilities such as bias, toxicity, misinformation, etc.

TruthfulQA

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

This is not the entire list (DeepEval has 15 benchmarks and counting), and Arize offers no benchmarks at all.

Integrations

Both tools offer integrations—but DeepEval goes further. While Arize mainly integrates with LLM frameworks like LangChain and LlamaIndex for tracing, DeepEval also supports evaluation integrations on top of observability.

That means teams can evaluate their LLM apps—no matter what stack they’re using—not just trace them.

DeepEval

Arize

Pytest

First-class integration with Pytest for testing in CI/CD

LangChain & LangGraph

Run evals within the Lang ecosystem, or apps built with it

LlamaIndex

Run evals within the LlamaIndex ecosystem, or apps built with it

Hugging Face

Run evals during fine-tuning/training of models

ChromaDB

Run evals on RAG pipelines built on Chroma

Weaviate

Run evals on RAG pipelines built on Weaviate

Elastic

Run evals on RAG pipelines built on Elastic

QDrant

Run evals on RAG pipelines built on Qdrant

PGVector

Run evals on RAG pipelines built on PGVector

Langsmith

Can be used within the Langsmith platform

Helicone

Can be used within the Helicone platform

Confident AI

Integrated with Confident AI

DeepEval also integrates directly with LLM providers to power its metrics—since DeepEval metrics are LLM agnostic.

Platform

Both DeepEval and Arize has their own platforms. DeepEval's platform is called Confident AI, and Arize's platform is called Phoenix.

Confident AI is built for powerful, customizable evaluation and benchmarking. Phoenix, on the other hand, is more focused on observability.

DeepEval

Arize

Metric annotation

Annotate the correctness of each metric

Sharable testing reports

Comprehensive reports that can be shared with stakeholders

A|B regression testing

Determine any breaking changes before deployment

Prompts and models experimentation

Figure out which prompts and models work best

Dataset editor

Domain experts can edit datasets on the cloud

Dataset revision history & backups

Point in time recovery, edit history, etc.

Limited

Metric score analysis

Score distributions, mean, median, standard deviation, etc.

Metric validation

False positives, false negatives, confusion matrices, etc.

Prompt versioning

Edit and manage prompts on the cloud instead of CSV

Metrics on the cloud

Run metrics on the platform instead of locally

Trigger evals via HTTPs

For users that are using (java/type)script

Trigger evals without code

For stakeholders that are non-technical

Alerts and notifications

Pings your slack, teams, discord, after each evaluation run.

LLM observability & tracing

Monitor LLM interactions in production

Online metrics in production

Continuously monitor LLM performance

Human feedback collection

Collect feedback from internal team members or end users

LLM guardrails

Ultra-low latency guardrails in production

LLM red teaming

Managed LLM safety testing and attack curation

Self-hosting

On-prem deployment so nothing leaves your data center

SSO

Authenticate with your Idp of choice

User roles & permissions

Custom roles, permissions, data segregation for different teams

Transparent pricing

Pricing should be available on the website

HIPAA-ready

For companies in the healthcare industry

SOCII certification

For companies that need additional security compliance

Confident AI is also self-served, meaning you don't have to talk to us to try it out. Sign up here.

Conclusion

If there’s one thing to remember: Arize is great for debugging, while Confident AI is built for LLM evaluation and benchmarking.

Both have their strengths and some feature overlap—but it really comes down to what you care about more: evaluation or observability.

If you want to do both, go with Confident AI. Most observability tools cover the basics, but few give you the depth and flexibility we offer for evaluation. That should be more than enough to get started with DeepEval.

DeepEval vs Langfuse

March 31, 2025 · 6 min read

Kritin Vongthongsri

DeepEval Guru

TL;DR: Langfuse has strong tracing capabilities, which is useful for debugging and monitoring in production, and easy to adopt thanks to solid integrations. It supports evaluations at a basic level, but lacks advanced features for heavier experimentation like A/B testing, custom metrics, granular test control. Langfuse takes a prompt-template-based approach to metrics (similar to Arize) which can be simplistic, but lacks the accuracy of research-backed metrics. The right tool depends on whether you’re focused solely on observability, or also investing in scalable, research-backed evaluation.

How is DeepEval Different?

1. Evaluation-First approach

Langfuse's tracing-first approach means evaluations are built into that workflow, which works well for lightweight checks. DeepEval, by contrast, is purpose-built for LLM benchmarking—with a robust evaluation feature set that includes custom metrics, granular test control, and scalable evaluation pipelines tailored for deeper experimentation.

This means:

Research-backed metrics for accurate, trustworthy evaluation results
Fully customizable metrics to fit your exact use case
Built-in A/B testing to compare model versions and identify top performers
Advanced analytics, including per-metric breakdowns across datasets, models, and time
Collaborative dataset editing to curate, iterate, and scale fast
End-to-end safety testing to ensure your LLM is not just accurate, but secure
Team-wide collaboration that brings engineers, researchers, and stakeholders into one loop

2. Team-wide collaboration

We’re obsessed with UX and DX: iterations, better error messages, and spinning off focused tools like DeepTeam (DeepEval red-teaming spinoff repo) when it provides a better experience. But DeepEval isn’t just for solo devs. It’s built for teams—engineers, researchers, and stakeholders—with shared dataset editing, public test reports, and everything you need to collaborate. LLM evals is a team effort, and we’re building for that.

3. Ship, ship, ship

Many of the features in DeepEval today were requested by our community. That's because we’re always active on DeepEval’s Discord, listening for bugs, feedback, and feature ideas. Most requests ship in under 3 days—bigger ones usually land within a week. Don’t hesitate to ask. If it helps you move faster, we’ll build it—for free.

The DAG metric is a perfect example: it went from idea to live docs in under a week. Before that, there was no clean way to define custom metrics with both full control and ease of use. Our users needed it, so we made it happen.

4. Lean features, more features, fewer bugs

We don’t believe in feature sprawl. Everything in DeepEval is built with purpose—to make your evaluations sharper, faster, and more reliable. No noise, just what moves the needle (more information in the table below).

We also built DeepEval as engineers from Google and AI researchers from Princeton—so we move fast, ship a lot, and don’t break things.

5. Founder accessibility

You’ll find us in the DeepEval Discord voice chat pretty much all the time — even if we’re muted, we’re there. It’s our way of staying open and approachable, which makes it super easy for users to hop in, say hi, or ask questions.

6. We scale with your evaluation needs

When you use DeepEval, everything is automatically integrated with Confident AI, which is the dashboard for analyzing DeepEval's evaluation results. This means it takes 0 extra lines of code to bring LLM evaluation to your team, and entire organization:

Analyze metric score distributions, averages, and median scores
Generate testing reports for you to inspect and debug test cases
Download and save testing results as CSV/JSON
Share testing reports within your organization and external stakeholders
Regression testing to determine whether your LLM app is OK to deploy
Experimentation with different models and prompts side-by-side
Keep datasets centralized on the cloud

Moreover, at some point, you’ll need to test for safety, not just performance. DeepEval includes DeepTeam, a built-in package for red teaming and safety testing LLMs. No need to switch tools or leave the ecosystem as your evaluation needs grow.

Comparing DeepEval and Langfuse

Langfuse has strong tracing capabilities and is easy to adopt due to solid integrations, making it a solid choice for debugging LLM applications. However, its evaluation capabilities are limited in several key areas:

Metrics are only available as prompt templates
No support for A/B regression testing
No statistical analysis of metric scores
Limited ability to experiment with prompts, models, and other LLM parameters

Prompt template-based metrics aren’t research-backed, offer limited control, and depend on single LLM outputs. They’re fine for early debugging or lightweight production checks, but they break down fast when you need structured experiments, side-by-side comparisons, or clear reporting for stakeholders.

Metrics

Langfuse allows users to create custom metrics using prompt templates but doesn't provide out-of-the-box metrics. This means you can use any prompt template to calculate metrics, but it also means that the metrics are research-backed, and don't give you granular score control.

DeepEval

Langfuse

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Red teaming metrics

Metrics for LLM safety and security like bias, PII leakage

Multi-modal metrics

Metrics involving image generations as well

Use case specific metrics

Summarization, JSON correctness, etc.

Custom, research-backed metrics

Custom metrics builder should have research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Fully customizable metrics

Use existing metric templates for full customization

Limited

Explanability

Metric provides reasons for all runs

Run using any LLM judge

Not vendor-locked into any framework for LLM providers

JSON-confineable

Custom LLM judges can be forced to output valid JSON for metrics

Limited

Verbose debugging

Debug LLM thinking processes during evaluation

Caching

Optionally save metric scores to avoid re-computation

Cost tracking

Track LLM judge token usage cost for each metric run

Integrates with Confident AI

Custom metrics or not, whether it can be on the cloud

Dataset Generation

Langfuse offers a dataset management UI, but doesn't have dataset generation capabilities.

DeepEval

Langfuse

Generate from documents

Synthesize goldens that are grounded in documents

Generate from ground truth

Synthesize goldens that are grounded in context

Generate free form goldens

Synthesize goldens that are not grounded

Quality filtering

Remove goldens that do not meet the quality standards

Non vendor-lockin

No Langchain, LlamaIndex, etc. required

Customize language

Generate in français, español, deutsch, italiano, 日本語, etc.

Customize output format

Generate SQL, code, etc. not just simple QA

Supports any LLMs

Generate using any LLMs, with JSON confinement

Save generations to Confident AI

Not just generate, but bring it to your organization

Red teaming

We created DeepTeam, our second open-source package, to make LLM red-teaming seamless (without the need to switch tool ecosystems) and scalable—when the need for LLM safety and security testing arises.

Langfuse doesn't offer red-teaming.

DeepEval

Langfuse

Predefined vulnerabilities

Vulnerabilities such as bias, toxicity, misinformation, etc.

Attack simulation

Simulate adversarial attacks to expose vulnerabilities

Single-turn attack methods

Prompt injection, ROT-13, leetspeak, etc.

Multi-turn attack methods

Linear jailbreaking, tree jailbreaking, etc.

Data privacy metrics

PII leakage, prompt leakage, etc.

Responsible AI metrics

Bias, toxicity, fairness, etc.

Unauthorized access metrics

RBAC, SSRF, shell injection, sql injection, etc.

Brand image metrics

Misinformation, IP infringement, robustness, etc.

Illegal risks metrics

Illegal activity, graphic content, personal safety, etc.

OWASP Top 10 for LLMs

Follows industry guidelines and standards

Using DeepTeam for LLM red-teaming means you get the same experience from using DeepEval for evaluations, but with LLM safety and security testing.

Checkout DeepTeam's documentation for more detail.

Benchmarks

DeepEval is the first framework to make LLM benchmarking easy and accessible. Previously, benchmarking meant digging through scattered repos, wrangling compute, and managing complex setups. With DeepEval, you can configure your model once and run all your benchmarks in under 10 lines of code.

Langfuse doesn't offer LLM benchmarking.

DeepEval

Langfuse

MMLU

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

Big-Bench Hard

Vulnerabilities such as bias, toxicity, misinformation, etc.

DROP

Vulnerabilities such as bias, toxicity, misinformation, etc.

TruthfulQA

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

This is not the entire list (DeepEval has 15 benchmarks and counting).

Integrations

Both tools offer a variety of integrations. Langfuse mainly integrates with LLM frameworks like LangChain and LlamaIndex for tracing, while DeepEval also supports evaluation integrations on top of observability.

DeepEval

Langfuse

Pytest

First-class integration with Pytest for testing in CI/CD

LangChain & LangGraph

Run evals within the Lang ecosystem, or apps built with it

LlamaIndex

Run evals within the LlamaIndex ecosystem, or apps built with it

Hugging Face

Run evals during fine-tuning/training of models

ChromaDB

Run evals on RAG pipelines built on Chroma

Weaviate

Run evals on RAG pipelines built on Weaviate

Elastic

Run evals on RAG pipelines built on Elastic

QDrant

Run evals on RAG pipelines built on Qdrant

PGVector

Run evals on RAG pipelines built on PGVector

Langsmith

Can be used within the Langsmith platform

Helicone

Can be used within the Helicone platform

Confident AI

Integrated with Confident AI

DeepEval also integrates directly with LLM providers to power its metrics, from closed-source providers like OpenAI and Azure to open-source providers like Ollama, vLLM, and more.

Platform

Both DeepEval and Langfuse has their own platforms. DeepEval's platform is called Confident AI, and Langfuse's platform is also called Langfuse. Confident AI is built for powerful, customizable evaluation and benchmarking. Langfuse, on the other hand, is more focused on observability.

DeepEval

Langfuse

Metric annotation

Annotate the correctness of each metric

Sharable testing reports

Comprehensive reports that can be shared with stakeholders

A|B regression testing

Determine any breaking changes before deployment

Prompts and models experimentation

Figure out which prompts and models work best

Limited

Dataset editor

Domain experts can edit datasets on the cloud

Dataset revision history & backups

Point in time recovery, edit history, etc.

Limited

Metric score analysis

Score distributions, mean, median, standard deviation, etc.

Metric validation

False positives, false negatives, confusion matrices, etc.

Prompt versioning

Edit and manage prompts on the cloud instead of CSV

Metrics on the cloud

Run metrics on the platform instead of locally

Trigger evals via HTTPs

For users that are using (java/type)script

Trigger evals without code

For stakeholders that are non-technical

Alerts and notifications

Pings your slack, teams, discord, after each evaluation run.

LLM observability & tracing

Monitor LLM interactions in production

Online metrics in production

Continuously monitor LLM performance

Human feedback collection

Collect feedback from internal team members or end users

LLM guardrails

Ultra-low latency guardrails in production

LLM red teaming

Managed LLM safety testing and attack curation

Self-hosting

On-prem deployment so nothing leaves your data center

SSO

Authenticate with your Idp of choice

User roles & permissions

Custom roles, permissions, data segregation for different teams

Transparent pricing

Pricing should be available on the website

HIPAA-ready

For companies in the healthcare industry

SOCII certification

For companies that need additional security compliance

Confident AI is also self-served, meaning you don't have to talk to us to try it out. Sign up here.

Conclusion

If there’s one takeaway: Langfuse is built for debugging, Confident AI is built for evaluation. They overlap in places, but the difference comes down to focus — observability vs. benchmarking. If you care about both, go with Confident AI, since it gives you far more depth and flexibility when it comes to evaluation.

DeepEval vs Ragas

March 19, 2025 · 8 min read

Jeffrey Ip

DeepEval Wizard

TL;DR: Ragas is well-suited for lightweight experimentation — much like using pandas for quick data analysis. DeepEval takes a broader approach, offering a full evaluation ecosystem designed for production workflows, CI/CD integration, custom metrics, and integration with Confident AI for team collaboration, reporting, and analysis. The right tool depends on whether you're running ad hoc evaluations or building scalable LLM testing into your LLM stack.

How is DeepEval Different?

1. We're built for developers

DeepEval was created by founders with a mixture of engineering backgrounds from Google and AI research backgrounds from Princeton. What you'll find is DeepEval is much more suited for an engineering workflow, while providing the necessary research in its metrics.

This means:

Unit-testing in CI/CD pipelines with DeepEval's first-class pytest integration
Modular, plug-and-play metrics that you can use to build your own evaluation pipeline
Less bugs and clearer error messages, so you know exactly what is going on
Extensive customizations with no vendor-locking into any LLM or framework
Abstracted into clear, extendable classes and methods for better reusability
Clean, readable code that is essential if you ever need to customize DeepEval for yourself
Exhaustive ecosystem, meaning you can easily build on top of DeepEval while taking advantage of DeepEval's features

2. We care about your experience, a lot

We care about the usability of DeepEval and wake up everyday thinking about how we can make either the codebase or documentation better to help our users do LLM evaluation better. In fact, everytime someone asks a question in DeepEval's discord, we always try to respond with not just an answer but a relevant link to the documentation that they can read more on. If there is no such relevant link that we can provide users, that means our documentation needs improving.

In terms of the codebase, a recent example is we actually broke away DeepEval's red teaming (safety testing) features into a whole now package, called DeepTeam, which took around a month of work, just so users that primarily need LLM red teaming can work in that repo instead.

3. We have a vibrant community

Whenever we're working, the team is always in the discord community on a voice call. Although we might not be talking all the time (in fact most times on mute), we do this to let users know we're always here whenever they run into a problem.

This means you'll find people are more willing to ask questions with active discussions going on.

4. We ship extremely fast

We always aim to resolve issues in DeepEval's discord in < 3 days. Sometimes, especially if there's too much going on in the company, it takes another week longer, and if you raise an issue on GitHub issues instead, we might miss it, but other than that, we're pretty consistent.

We also take a huge amount of effort to ship the latest features required for the best LLM evaluation in an extremely short amount of time (it took under a week for the entire DAG metric to be built, tested, with documentation written). When we see something that could clearly help our users, we get it done.

5. We offer more features, with less bugs

Our heavy engineering backgrounds allow us to ship more features with less bugs in them. Given that we aim to handle all errors that happen within DeepEval gracefully, your experience when using DeepEval will be a lot better.

There's going to be a few comparison tables in later sections to talk more about the additional features you're going to get with DeepEval.

6. We scale with your evaluation needs

This means 0 extra lines of code to:

Analyze metric score distributions, averages, and median scores
Generate testing reports for you to inspect and debug test cases
Download and save testing results as CSV/JSON
Share testing reports within your organization and external stakeholders
Regression testing to determine whether your LLM app is OK to deploy
Experimentation with different models and prompts side-by-side
Keep datasets centralized on the cloud

Comparing DeepEval and Ragas

If DeepEval is so good, why is Ragas so popular? Ragas started off as a research paper that focused on the reference-less evaluation of RAG pipelines in early 2023 and got mentioned by OpenAI during their dev day in November 2023.

But the very research nature of Ragas means that you're not going to get as good a developer experience compared to DeepEval. In fact, we had to re-implement all of Ragas's metrics into our own RAG metrics back in early 2024 because they didn't offer things such as:

Explanability (reasoning for metric scores)
Verbose debugging (the thinking process of LLM judges used for evaluation)
Using any custom LLM-as-a-judge (as required by many organizations)
Evaluation cost tracking

And our users simply couldn't wait for Ragas to ship it before being able to use it in DeepEval's ecosystem (that's why you see that we have our own RAG metrics, and the RAGASMetric, which just wraps around Ragas' metrics but with less functionality).

For those that argues that Ragas is more trusted because they have a research-paper, that was back in 2023 and the metrics has changed a lot since then.

Metrics

DeepEval and Ragas both specialize in RAG evaluation, however:

Ragas's metrics has limited support for explanability, verbose log debugging, and error handling, and customizations
DeepEval's metrics go beyond RAG, with support for agentic workflows, LLM chatbot conversations, all through its plug-and-play metrics.

DeepEval also integrates with Confident AI so you can bring these metrics to your organization whenever you're ready.

DeepEval

Ragas

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Red teaming metrics

Metrics for LLM safety and security like bias, PII leakage

Multi-modal metrics

Metrics involving image generations as well

Use case specific metrics

Summarization, JSON correctness, etc.

Custom, research-backed metrics

Custom metrics builder should have research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Fully customizable metrics

Use existing metric templates for full customization

Explanability

Metric provides reasons for all runs

Run using any LLM judge

Not vendor-locked into any framework for LLM providers

JSON-confineable

Custom LLM judges can be forced to output valid JSON for metrics

Verbose debugging

Debug LLM thinking processes during evaluation

Caching

Optionally save metric scores to avoid re-computation

Cost tracking

Track LLM judge token usage cost for each metric run

Integrates with Confident AI

Custom metrics or not, whether it can be on the cloud

Dataset Generation

DeepEval and Ragas both offers in dataset generation, and while Ragas is deeply locked into the Langchain and LlamaIndex ecosystem, meaning you can't easily generate from any documents, and offers limited customizations, DeepEval's synthesizer is 100% customizable within a few lines of code

If you look at the table below, you'll see that DeepEval's synthesizer is very flexible.

DeepEval

Ragas

Generate from documents

Synthesize goldens that are grounded in documents

Generate from ground truth

Synthesize goldens that are grounded in context

Generate free form goldens

Synthesize goldens that are not grounded

Quality filtering

Remove goldens that do not meet the quality standards

Non vendor-lockin

No Langchain, LlamaIndex, etc. required

Customize language

Generate in français, español, deutsch, italiano, 日本語, etc.

Customize output format

Generate SQL, code, etc. not just simple QA

Supports any LLMs

Generate using any LLMs, with JSON confinement

Save generations to Confident AI

Not just generate, but bring it to your organization

Red teaming

We even built a second open-source package dedicated for red teaming within DeepEval's ecosystem, just so you don't have to worry about switching frameworks as you scale to safety testing.

Ragas offers no red teaming at all.

DeepEval

Ragas

Predefined vulnerabilities

Vulnerabilities such as bias, toxicity, misinformation, etc.

Attack simulation

Simulate adversarial attacks to expose vulnerabilities

Single-turn attack methods

Prompt injection, ROT-13, leetspeak, etc.

Multi-turn attack methods

Linear jailbreaking, tree jailbreaking, etc.

Data privacy metrics

PII leakage, prompt leakage, etc.

Responsible AI metrics

Bias, toxicity, fairness, etc.

Unauthorized access metrics

RBAC, SSRF, shell injection, sql injection, etc.

Brand image metrics

Misinformation, IP infringement, robustness, etc.

Illegal risks metrics

Illegal activity, graphic content, personal safety, etc.

OWASP Top 10 for LLMs

Follows industry guidelines and standards

We want users to stay in DeepEval's ecosystem even for LLM red teaming, because this allows us to provide you the same experience you get from DeepEval, even for LLM safety and security testing.

Checkout DeepTeam's documentation, which powers DeepEval's red teaming capabilities, for more detail.

Benchmarks

This was more of a fun project, but when we noticed LLM benchmarks were so get hold of we decided to make DeepEval the first framework to make LLM benchmarks so widely accessible. In the past, benchmarking foundational models were compute-heavy and messy. Now with DeepEval, 10 lines of code is all that is needed.

DeepEval

Ragas

MMLU

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

Big-Bench Hard

Vulnerabilities such as bias, toxicity, misinformation, etc.

DROP

Vulnerabilities such as bias, toxicity, misinformation, etc.

TruthfulQA

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

This is not the entire list (DeepEval has 15 benchmarks and counting), and Ragas offers no benchmarks at all.

Integrations

Both offer integrations, but with a different focus. Ragas' integrations pushes users onto other platforms such as Langsmith and Helicone, while DeepEval is more focused on providing users the means to evaluate their LLM applications no matter whatever stack they are currently using.

DeepEval

Ragas

Pytest

First-class integration with Pytest for testing in CI/CD

LangChain & LangGraph

Run evals within the Lang ecosystem, or apps built with it

LlamaIndex

Run evals within the LlamaIndex ecosystem, or apps built with it

Hugging Face

Run evals during fine-tuning/training of models

ChromaDB

Run evals on RAG pipelines built on Chroma

Weaviate

Run evals on RAG pipelines built on Weaviate

Elastic

Run evals on RAG pipelines built on Elastic

QDrant

Run evals on RAG pipelines built on Qdrant

PGVector

Run evals on RAG pipelines built on PGVector

Langsmith

Can be used within the Langsmith platform

Helicone

Can be used within the Helicone platform

Confident AI

Integrated with Confident AI

You'll notice that Ragas does not own their platform integrations such as LangSmith, while DeepEval owns Confident AI. This means bringing LLM evaluation to your organization is 10x easier using DeepEval.

Platform

Both DeepEval and Ragas has their own platforms. DeepEval's platform is called Confident AI, and Ragas's platform is also called Ragas.

Both have varying degrees of capabilities, and you can draw your own conclusions from the table below.

DeepEval

Ragas

Metric annotation

Annotate the correctness of each metric

Sharable testing reports

Comprehensive reports that can be shared with stakeholders

A|B regression testing

Determine any breaking changes before deployment

Prompts and models experimentation

Figure out which prompts and models work best

Dataset editor

Domain experts can edit datasets on the cloud

Dataset revision history & backups

Point in time recovery, edit history, etc.

Metric score analysis

Score distributions, mean, median, standard deviation, etc.

Metric validation

False positives, false negatives, confusion matrices, etc.

Prompt versioning

Edit and manage prompts on the cloud instead of CSV

Metrics on the cloud

Run metrics on the platform instead of locally

Trigger evals via HTTPs

For users that are using (java/type)script

Trigger evals without code

For stakeholders that are non-technical

Alerts and notifications

Pings your slack, teams, discord, after each evaluation run.

LLM observability & tracing

Monitor LLM interactions in production

Online metrics in production

Continuously monitor LLM performance

Human feedback collection

Collect feedback from internal team members or end users

LLM guardrails

Ultra-low latency guardrails in production

LLM red teaming

Managed LLM safety testing and attack curation

Self-hosting

On-prem deployment so nothing leaves your data center

SSO

Authenticate with your Idp of choice

User roles & permissions

Custom roles, permissions, data segregation for different teams

Transparent pricing

Pricing should be available on the website

HIPAA-ready

For companies in the healthcare industry

SOCII certification

For companies that need additional security compliance

Confident AI is also self-served, meaning you don't have to talk to us to try it out. Sign up here.

Conclusion

If there's one thing to remember, we care about your LLM evaluation experience more than anyone else, and apart from anything else this should be more than enough to get started with DeepEval.

DeepEval vs Trulens

March 19, 2025 · 4 min read

Jeffrey Ip

DeepEval Wizard

TL;DR: TruLens offers useful tooling for basic LLM app monitoring and runtime feedback, but it’s still early-stage and lacks many core evaluation features — including agentic and conversational metrics, granular test control, and safety testing. DeepEval takes a more complete approach to LLM evaluation, supporting structured testing, CI/CD workflows, custom metrics, and integration with Confident AI for collaborative analysis, sharing, and decision-making across teams.

What Makes DeepEval Stand Out?

1. Purpose-Built for Developers

DeepEval is designed by engineers with roots at Google and AI researchers from Princeton — so naturally, it's built to slot right into an engineering workflow without sacrificing metric rigor.

Key developer-focused advantages include:

Seamless CI/CD integration via native pytest support
Composable metric modules for flexible pipeline design
Cleaner error messaging and fewer bugs
No vendor lock-in — works across LLMs and frameworks
Extendable abstractions built with reusable class structures
Readable, modifiable code that scales with your needs
Ecosystem ready — DeepEval is built to be built on

2. We Obsess Over Developer Experience

From docs to DX, we sweat the details. Whether it's refining error handling or breaking off red teaming into a separate package (deepteam), we're constantly iterating based on what you need.

Every Discord question is an opportunity to improve the product. If the docs don’t have an answer, that’s our cue to fix it.

3. The Community is Active (and Always On)

We're always around — literally. The team hangs out in the DeepEval Discord voice chat while working (yes, even if muted). It makes us accessible, and users feel more comfortable jumping in and asking for help. It’s part of our culture.

4. Fast Releases, Fast Fixes

Most issues reported in Discord are resolved in under 3 days. If it takes longer, we communicate — and we prioritize.

When something clearly helps our users, we move fast. For instance, we shipped the full DAG metric — code, tests, and docs — in under a week.

5. More Features, Fewer Bugs

Because our foundation is engineering-first, you get a broader feature set with fewer issues. We aim for graceful error handling and smooth dev experience, so you're not left guessing when something goes wrong.

Comparison tables below will show what you get with DeepEval out of the box.

6. Scales with Your Org

DeepEval works out of the box for teams — no extra setup needed. It integrates automatically with Confident AI, our dashboard for visualizing and sharing LLM evaluation results.

Without writing any additional code, you can:

Visualize score distributions and trends
Generate and share test reports internally or externally
Export results to CSV or JSON
Run regression tests for safe deployment
Compare prompts, models, or changes side-by-side
Manage and reuse centralized datasets

For safety-focused teams, DeepTeam (our red teaming toolkit) plugs right in. DeepEval is an ecosystem — not a dead end.

Comparing DeepEval and Trulens

If you're reading this, there's a good chance you're in academia. Trulens was founded by Stanford professors and got really popular back in late 2023 and early 2024 through a DeepLearning course with Andrew Ng. However the traction slowly died after this initial boost, especially after the Snowflake acquisition.

And so, you'll find DeepEval provides a lot more well-rounded features and support for all different use cases (RAG, agentic, conversations), and completes all parts of the evaluation workflow (dataset generation, benchmarking, platform integration, etc.).

Metrics

DeepEval does RAG evaluation very well, but it doesn't end there.

DeepEval

Trulens

RAG metrics

The popular RAG metrics such as faithfulness

Conversational metrics

Evaluates LLM chatbot conversationals

Agentic metrics

Evaluates agentic workflows, tool use

Red teaming metrics

Metrics for LLM safety and security like bias, PII leakage

Multi-modal metrics

Metrics involving image generations as well

Use case specific metrics

Summarization, JSON correctness, etc.

Custom, research-backed metrics

Custom metrics builder should have research-backing

Custom, deterministic metrics

Custom, LLM powered decision-based metrics

Fully customizable metrics

Use existing metric templates for full customization

Explanability

Metric provides reasons for all runs

Run using any LLM judge

Not vendor-locked into any framework for LLM providers

JSON-confineable

Custom LLM judges can be forced to output valid JSON for metrics

Verbose debugging

Debug LLM thinking processes during evaluation

Caching

Optionally save metric scores to avoid re-computation

Cost tracking

Track LLM judge token usage cost for each metric run

Integrates with Confident AI

Custom metrics or not, whether it can be on the cloud

Dataset Generation

DeepEval offers a comprehensive synthetic data generator while Trulens does not have any generation capabilities.

DeepEval

Trulens

Generate from documents

Synthesize goldens that are grounded in documents

Generate from ground truth

Synthesize goldens that are grounded in context

Generate free form goldens

Synthesize goldens that are not grounded

Quality filtering

Remove goldens that do not meet the quality standards

Non vendor-lockin

No Langchain, LlamaIndex, etc. required

Customize language

Generate in français, español, deutsch, italiano, 日本語, etc.

Customize output format

Generate SQL, code, etc. not just simple QA

Supports any LLMs

Generate using any LLMs, with JSON confinement

Save generations to Confident AI

Not just generate, but bring it to your organization

Red teaming

Trulens offers no red teaming at all, so only DeepEval will help you as you scale to safety and security LLM testing.

DeepEval

Trulens

Predefined vulnerabilities

Vulnerabilities such as bias, toxicity, misinformation, etc.

Attack simulation

Simulate adversarial attacks to expose vulnerabilities

Single-turn attack methods

Prompt injection, ROT-13, leetspeak, etc.

Multi-turn attack methods

Linear jailbreaking, tree jailbreaking, etc.

Data privacy metrics

PII leakage, prompt leakage, etc.

Responsible AI metrics

Bias, toxicity, fairness, etc.

Unauthorized access metrics

RBAC, SSRF, shell injection, sql injection, etc.

Brand image metrics

Misinformation, IP infringement, robustness, etc.

Illegal risks metrics

Illegal activity, graphic content, personal safety, etc.

OWASP Top 10 for LLMs

Follows industry guidelines and standards

Checkout DeepTeam's documentation, which powers DeepEval's red teaming capabilities, for more detail.

Benchmarks

In the past, benchmarking foundational models were compute-heavy and messy. Now with DeepEval, 10 lines of code is all that is needed.

DeepEval

Trulens

MMLU

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

Big-Bench Hard

Vulnerabilities such as bias, toxicity, misinformation, etc.

DROP

Vulnerabilities such as bias, toxicity, misinformation, etc.

TruthfulQA

Vulnerabilities such as bias, toxicity, misinformation, etc.

HellaSwag

Vulnerabilities such as bias, toxicity, misinformation, etc.

This is not the entire list (DeepEval has 15 benchmarks and counting), and Trulens offers no benchmarks at all.

Integrations

DeepEval offers countless integrations with the tools you are likely already building with.

DeepEval

Trulens

Pytest

First-class integration with Pytest for testing in CI/CD

LangChain & LangGraph

Run evals within the Lang ecosystem, or apps built with it

LlamaIndex

Run evals within the LlamaIndex ecosystem, or apps built with it

Hugging Face

Run evals during fine-tuning/training of models

ChromaDB

Run evals on RAG pipelines built on Chroma

Weaviate

Run evals on RAG pipelines built on Weaviate

Elastic

Run evals on RAG pipelines built on Elastic

QDrant

Run evals on RAG pipelines built on Qdrant

PGVector

Run evals on RAG pipelines built on PGVector

Snowflake

Integrated with Snowflake logs

Confident AI

Integrated with Confident AI

Platform

DeepEval's platform is called Confident AI, and Trulen's platform is hidden and minimal.

DeepEval

Trulens

Sharable testing reports

Comprehensive reports that can be shared with stakeholders

A|B regression testing

Determine any breaking changes before deployment

Prompts and models experimentation

Figure out which prompts and models work best

Dataset editor

Domain experts can edit datasets on the cloud

Dataset revision history & backups

Point in time recovery, edit history, etc.

Metric score analysis

Score distributions, mean, median, standard deviation, etc.

Metric annotation

Annotate the correctness of each metric

Metric validation

False positives, false negatives, confusion matrices, etc.

Prompt versioning

Edit and manage prompts on the cloud instead of CSV

Metrics on the cloud

Run metrics on the platform instead of locally

Trigger evals via HTTPs

For users that are using (java/type)script

Trigger evals without code

For stakeholders that are non-technical

Alerts and notifications

Pings your slack, teams, discord, after each evaluation run.

LLM observability & tracing

Monitor LLM interactions in production

Online metrics in production

Continuously monitor LLM performance

Human feedback collection

Collect feedback from internal team members or end users

LLM guardrails

Ultra-low latency guardrails in production

LLM red teaming

Managed LLM safety testing and attack curation

Self-hosting

On-prem deployment so nothing leaves your data center

SSO

Authenticate with your Idp of choice

User roles & permissions

Custom roles, permissions, data segregation for different teams

Transparent pricing

Pricing should be available on the website

HIPAA-ready

For companies in the healthcare industry

SOCII certification

For companies that need additional security compliance

Confident AI is also self-served, meaning you don't have to talk to us to try it out. Sign up here.

Conclusion

DeepEval offers much more features and better community, and should be more than enough to support all your LLM evaluation needs. Get started with DeepEval here.

TL;DR​

The Unique Challenges​

Building the Chatbot​

Evaluating Your Chatbot with DeepEval​

Simulating conversations​

Evaluating the chatbot​

Improving Your Chatbot with DeepEval​

Unit Testing in CI/CD for Continuous Evaluation​

Conclusion​

Evaluating Your Retriever with DeepEval​

Building a basic retriever​

Generating Goldens​

Improving your retriever​

Evaluating Your Generator with DeepEval​

Building a basic generator​

Improving your generator​

CI/CD Integration for Continuous Evaluation​

Why generate golden data in CI?​

Integrating DeepEval tests into your CI/CD​

Conclusion​

The Challenge That Cognee Faced​

Cognee's Comprehensive Approach Using DeepEval​

How Cognee Used DeepEval's Correctness Metric​

Leveraging DeepEval's Contextual Relevancy Metric​

DeepEval's Context Coverage in Action​

What Cognee Learned About AI Memory Evaluation​

Understanding Evaluation Stability Challenges​

Discovering Evaluation Bias Patterns​

Real-World Technical Implementation Lessons​

Broader Implications for AI Memory Research​

Demonstrating the Power of Standardized Evaluation​

Showcasing Multi-Dimensional Evaluation Benefits​

How This Case Study Impacts AI Memory Development​

Future Directions Inspired by This Case Study​

Lessons for Other DeepEval Users​

What's Next for DeepEval and AI Memory Evaluation​

Explore Cognee's Research​

What is G-Eval?​

Why DeepEval for G-Eval?​

Answer Correctness​

Best practices​

Coherence​

Criteria​

Best practices​

Tonality​

Criteria​

Best practices​

Safety​

Criteria​

Best practices​

Custom RAG Metrics​

Criteria​

Best practices​

Advanced Usage​

Using G-Eval in DAG​

Example​

Conclusion​

Ragas​

Ragas vs Deepeval Summary​

Key differences​

What people like about Ragas​

What people dislike about Ragas​

Arize AI Phoenix​

Arize vs Deepeval Summary​

Key differences​

What people like about Arize​

What people dislike about Arize​

Promptfoo​

Promptfoo vs Deepeval Summary​

Key differences​

What people like about Promptfoo​

What people dislike about Promptfoo​

Langfuse​

Langfuse vs Deepeval Summary​

Key differences​

What people like about Langfuse​

What people dislike about Langfuse​

Braintrust​

Braintrust vs Deepeval Summary​

Key differences​

TL;DR

The Unique Challenges

Building the Chatbot

Evaluating Your Chatbot with DeepEval

Simulating conversations

Evaluating the chatbot

Improving Your Chatbot with DeepEval

Unit Testing in CI/CD for Continuous Evaluation

Conclusion

Evaluating Your Retriever with DeepEval

Building a basic retriever

Generating Goldens

Improving your retriever

Evaluating Your Generator with DeepEval

Building a basic generator

Improving your generator

CI/CD Integration for Continuous Evaluation

Why generate golden data in CI?

Integrating DeepEval tests into your CI/CD

Conclusion

The Challenge That Cognee Faced

Cognee's Comprehensive Approach Using DeepEval

How Cognee Used DeepEval's Correctness Metric

Leveraging DeepEval's Contextual Relevancy Metric

DeepEval's Context Coverage in Action

What Cognee Learned About AI Memory Evaluation

Understanding Evaluation Stability Challenges

Discovering Evaluation Bias Patterns

Real-World Technical Implementation Lessons

Broader Implications for AI Memory Research

Demonstrating the Power of Standardized Evaluation

Showcasing Multi-Dimensional Evaluation Benefits

How This Case Study Impacts AI Memory Development

Future Directions Inspired by This Case Study

Lessons for Other DeepEval Users

What's Next for DeepEval and AI Memory Evaluation

Explore Cognee's Research

What is G-Eval?

Why DeepEval for G-Eval?

Answer Correctness

Best practices

Coherence

Criteria

Best practices

Tonality

Criteria

Best practices

Safety

Criteria

Best practices

Custom RAG Metrics

Criteria

Best practices

Advanced Usage

Using G-Eval in DAG

Example

Conclusion

Ragas

Ragas vs Deepeval Summary

Key differences

What people like about Ragas

What people dislike about Ragas

Arize AI Phoenix

Arize vs Deepeval Summary

Key differences

What people like about Arize

What people dislike about Arize

Promptfoo

Promptfoo vs Deepeval Summary

Key differences

What people like about Promptfoo

What people dislike about Promptfoo

Langfuse

Langfuse vs Deepeval Summary

Key differences

What people like about Langfuse

What people dislike about Langfuse

Braintrust

Braintrust vs Deepeval Summary

Key differences