Skip to main content

Metrics Introduction

deepeval offers 40+ SOTA, read-to-use metrics for you to quickly get started with. Essentially, while the metric acts as the ruler based on a specific criteria of interest, a test case represents the thing you're trying to measure.

Quick Summary

Almost all predefined metrics on deepeval uses LLM-as-a-judge, with various techniques such as QAG (question-answer-generation), DAG (deep acyclic graphs), and G-Eval to score test cases, which represents atomic interactions with your LLM app.

All of deepeval's metrics output a score between 0-1 based on its corresponding equation, as well as score reasoning. A metric is only successful if the evaluation score is equal to or greater than threshold, which is defaulted to 0.5 for all metrics.

Custom metrics allow you to define your custom criteria using SOTA implementations of LLM-as-a-Judge metrics in everyday language:

  • G-Eval
  • DAG (Deep Acyclic Graph)
  • Conversational G-Eval
  • Multi-modal G-Eval
  • Arena G-Eval
  • Do it yourself, 100% self-coded metrics (e.g. if you want to use BLEU, ROUGE)

You should aim to have at least one custom metric in your LLM evals pipeline.

info

Most metrics only require 1-2 parameters in a test case, so it's important that you visit each metric's documentation pages to learn what's required.

Your LLM app can be evaluated end-to-end (component-level example further below) by providing a list of metrics and test cases:

main.py
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate

evaluate(
metrics=[AnswerRelevancyMetric()],
test_cases=[LLMTestCase(input="What's DeepEval?", actual_output="Your favorite eval framework's favorite evals framework.")]
)

If you're logged into Confident AI before running an evaluation (deepeval login or deepeval view in the CLI), you'll also get entire testing reports on the platform:

Run Evaluations on Confident AI

More information on everything can be found on the Confident AI evaluation docs.

Why DeepEval Metrics?

Apart from the variety of metrics offered, deepeval's metrics are a step up to other implementations because they:

  • Research-backed LLM-as-as-Judge (GEval)
  • Make deterministic metric scores possible (when using DAGMetric)
  • Are extra reliable as LLMs are only used for extremely confined tasks during evaluation to greatly reduce stochasticity and flakiness in scores
  • Provide a comprehensive reason for the scores computed
  • Integrated 100% with Confident AI

Create Your First Metric

Custom Metrics

deepeval provides G-Eval, a state-of-the-art LLM evaluation framework for anyone to create a custom LLM-evaluated metric using natural language. G-Eval is available for all single-turn, multi-turn, and multimodal evals.

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")
correctness = GEval(
name="Correctness",
criteria="Correctness - determine if the actual output is correct according to the expected output.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
strict_mode=True
)

correctness.measure(test_case)
print(correctness.score, correctness.reason)

Under the hood, deepeval first generates a series of evaluation steps, before using these steps in conjunction with information in an LLMTestCase for evaluation. For more information, visit the G-Eval documentation page.

tip

If you're looking for decision-tree based LLM-as-a-Judge, checkout the Deep Acyclic Graph (DAG) metric.

Default Metrics

The most used RAG metrics include:

  • Answer Relevancy: Evaluates if the generated answer is relevant to the user query
  • Faithfulness: Measures if the generated answer is factually consistent with the provided context
  • Contextual Relevancy: Assesses if the retrieved context is relevant to the user query
  • Contextual Recall: Evaluates if the retrieved context contains all relevant information
  • Contextual Precision: Measures if the retrieved context is precise and focused

Which can be simply imported from the deepeval.metrics module:

main.py
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

test_case = LLMTestCase(input="...", actual_output="...")
relevancy = AnswerRelevancyMetric(threshold=0.5)

relevancy.measure(test_case)
print(relevancy.score, relevancy.reason)

Choosing Your Metrics

These are the metric categories to consider when choosing your metrics:

  • Custom metrics are use case specific and architecture agnostic:
    • G-Eval – best for subjective criteria like correctness, coherence, or tone; easy to set up.
    • DAG – decision-tree metric for objective or mixed criteria (e.g., verify format before tone).
    • Start with G-Eval for simplicity; use DAG for more control. You can also subclass BaseMetric to create your own.
  • Generic metrics are system specific and use case agnostic:
    • RAG metrics: measures retriever and generator separately
    • Agent metrics: evaluate tool usage and task completion
    • Multi-turn metrics: measure overall dialogue quality
    • Combine these for multi-component LLM systems.
  • Reference vs. Referenceless:
    • Reference-based metrics need ground truth (e.g., contextual recall or tool correctness).
    • Referenceless metrics work without labeled data, ideal for online or production evaluation.
    • Check each metric’s docs for required parameters.
info

If you're running metrics in production, you must choose a referenceless metric since no labelled data will exist.

When deciding on metrics, no matter how tempting, try to limit yourself to no more than 5 metrics, with this breakdown:

  • 2-3 generic, system-specific metrics (e.g. contextual precision for RAG, tool correctness for agents)
  • 1-2 custom, use case-specific metrics (e.g. helpfulness for a medical chatbot, format correctness for summarization)

The goal is to force yourself to prioritize and clearly define your evaluation criteria. This will not only help you use deepeval, but also help you understand what you care most about in your LLM application.

Here are some additional ideas if you're not sure:

  • RAG: Focus on the AnswerRelevancyMetric (evaluates actual_output alignment with the input) and FaithfulnessMetric (checks for hallucinations against retrieved_context)
  • Agents: Use the ToolCorrectnessMetric to verify proper tool selection and usage
  • Chatbots: Implement a ConversationCompletenessMetric to assess overall conversation quality
  • Custom Requirements: When standard metrics don't fit your needs, create custom evaluations with G-Eval or DAG frameworks

In some cases, where your LLM model is doing most of the heavy lifting, it is not uncommon to have more use case specific metrics.

Configure LLM Judges

You can use ANY LLM judge in deepeval, including OpenAI, Azure OpenAI, Ollama, Anthropic, Gemini, LiteLLM, etc. You can also wrap your own LLM API in deepeval's DeepEvalBaseLLM class to use ANY model of your choice. Click here for full guide.

To use OpenAI for deepeval's LLM metrics, supply your OPENAI_API_KEY in the CLI:

export OPENAI_API_KEY=<your-openai-api-key>

Alternatively, if you're working in a notebook environment (Jupyter or Colab), set your OPENAI_API_KEY in a cell:

%env OPENAI_API_KEY=<your-openai-api-key>
note

Please do not include quotation marks when setting your API_KEYS as environment variables if you're working in a notebook environment.

Using Metrics

There are three ways you can use metrics:

  1. End-to-end evals, treating your LLM system as a black-box and evaluating the system inputs and outputs.
  2. Component-level evals, placing metrics on individual components in your LLM app instead.
  3. One-off (or standalone) evals, where you would use a metric to execute it individually.

For End-to-End Evals

To run end-to-end evaluations of your LLM system using any metric of your choice, simply provide a list of test cases to evaluate your metrics against:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate

test_case = LLMTestCase(input="...", actual_output="...")

evaluate(test_cases=[test_case], metrics=[AnswerRelevancyMetric()])

The evaluate() function or deepeval test run is the best way to run evaluations. They offer tons of features out of the box, including caching, parallelization, cost tracking, error handling, and integration with Confident AI.

tip

deepeval test run is deepeval's native Pytest integration, which allows you to run evals in CI/CD pipelines.

For Component-Level Evals

To run component-level evaluations of your LLM system using any metric of your choice, simply decorate your components with @observe and create test cases at runtime:

from deepeval.dataset import EvaluationDataset, Golden
from deepeval.tracing import observe, update_current_span
from deepeval.metrics import AnswerRelevancyMetric

# 1. observe() decorator traces LLM components
@observe()
def llm_app(input: str):
# 2. Supply metric at any component
@observe(metrics=[AnswerRelevancyMetric()])
def nested_component():
# 3. Create test case at runtime
update_current_span(test_case=LLMTestCase(...))
pass

nested_component()

# 4. Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="Test input")])

# 5. Loop through dataset
for goldens in dataset.evals_iterator():
# Call LLM app
llm_app(golden.input)

For One-Off Evals

You can also execute each metric individually. All metrics in deepeval, including custom metrics that you create:

  • can be executed via the metric.measure() method
  • can have its score accessed via metric.score, which ranges from 0 - 1
  • can have its score reason accessed via metric.reason
  • can have its status accessed via metric.is_successful()
  • can be used to evaluate test cases or entire datasets, with or without Pytest
  • has a threshold that acts as the threshold for success. metric.is_successful() is only true if metric.score is above/below threshold
  • has a strict_mode property, which when turned on enforces metric.score to a binary one
  • has a verbose_mode property, which when turned on prints metric logs whenever a metric is executed

In addition, all metrics in deepeval execute asynchronously by default. You can configure this behavior using the async_mode parameter when instantiating a metric.

tip

Visit an individual metric page to learn how they are calculated, and what is required when creating an LLMTestCase in order to execute it.

Here's a quick example:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Initialize a test case
test_case = LLMTestCase(...)

# Initialize metric with threshold
metric = AnswerRelevancyMetric(threshold=0.5)
metric.measure(test_case)

print(metric.score, metric.reason)

All of deepeval's metrics give a reason alongside its score.

Using Metrics Async

When a metric's async_mode=True (which is the default for all metrics), invocations of metric.measure() will execute internal algorithms concurrently. However, it's important to note that while operations INSIDE measure() execute concurrently, the metric.measure() call itself still blocks the main thread.

info

Let's take the FaithfulnessMetric algorithm for example:

  1. Extract all factual claims made in the actual_output
  2. Extract all factual truths found in the retrieval_context
  3. Compare extracted claims and truths to generate a final score and reason.
from deepeval.metrics import FaithfulnessMetric
...

metric = FaithfulnessMetric(async_mode=True)
metric.measure(test_case)
print("Metric finished!")

When async_mode=True, steps 1 and 2 execute concurrently (i.e., at the same time) since they are independent of each other, while async_mode=False causes steps 1 and 2 to execute sequentially instead (i.e., one after the other).

In both cases, "Metric finished!" will wait for metric.measure() to finish running before printing, but setting async_mode to True would make the print statement appear earlier, as async_mode=True allows metric.measure() to run faster.

To measure multiple metrics at once and NOT block the main thread, use the asynchronous a_measure() method instead.

import asyncio
...

# Remember to use async
async def long_running_function():
# These will all run at the same time
await asyncio.gather(
metric1.a_measure(test_case),
metric2.a_measure(test_case),
metric3.a_measure(test_case),
metric4.a_measure(test_case)
)
print("Metrics finished!")

asyncio.run(long_running_function())

Debug A Metric Judgement

You can turn on verbose_mode for ANY deepeval metric at metric initialization to debug a metric whenever the measure() or a_measure() method is called:

...

metric = AnswerRelevancyMetric(verbose_mode=True)
metric.measure(test_case)
note

Turning verbose_mode on will print the inner workings of a metric whenever measure() or a_measure() is called.

Customize Metric Prompts

All of deepeval's metrics use LLM-as-a-judge evaluation with unique default prompt templates for each metric. While deepeval has well-designed algorithms for each metric, you can customize these prompt templates to improve evaluation accuracy and stability. Simply provide a custom template class as the evaluation_template parameter to your metric of choice (example below).

info

For example, in the AnswerRelevancyMetric, you might disagree with what we consider something to be "relevant", but with this capability you can now override any opinions deepeval has in its default evaluation prompts.

You'll find this particularly valuable when using a custom LLM, as deepeval's default metrics are optimized for OpenAI's models, which are generally more powerful than most custom LLMs.

note

This means you can better handle invalid JSON outputs (along with JSON confinement) which comes with weaker models, and provide better examples for in-context learning for your custom LLM judges for better metric accuracy.

Here's a quick example of how you can define a custom AnswerRelevancyTemplate and inject it into the AnswerRelevancyMetric through the evaluation_params parameter:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics.answer_relevancy import AnswerRelevancyTemplate

# Define custom template
class CustomTemplate(AnswerRelevancyTemplate):
@staticmethod
def generate_statements(actual_output: str):
return f"""Given the text, breakdown and generate a list of statements presented.

Example:
Our new laptop model features a high-resolution Retina display for crystal-clear visuals.

{{
"statements": [
"The new laptop model has a high-resolution Retina display."
]
}}
===== END OF EXAMPLE ======

Text:
{actual_output}

JSON:
"""

# Inject custom template to metric
metric = AnswerRelevancyMetric(evaluation_template=CustomTemplate)
metric.measure(...)
tip

You can find examples of how this can be done in more detail on the Customize Your Template section of each individual metric page, which shows code examples, and a link to deepeval's GitHub showing the default templates currently used.

What About Non-LLM-as-a-judge Metrics?

If you're looking to use something like ROUGE, BLEU, or BLEURT, etc. you can create a custom metric and use the scorer module available in deepeval for scoring by following this guide.

The scorer module is available but not documented because our experience tells us these scorers are not useful as LLM metrics where outputs require a high level of reasoning to evaluate.