Summarization

LLM-as-a-judge

Referenceless

Multimodal

The summarization metric uses LLM-as-a-judge to determine whether your LLM (application) is generating factually correct summaries while including the necessary details from the original text. In a summarization task within deepeval, the original text refers to the input while the summary is the actual_output.

Required Arguments

To use the SummarizationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output

Read the How Is It Calculated section below to learn how test case parameters are used for metric calculation.

Usage

Let's take this input and actual_output as an example:

# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

You can use the SummarizationMetric as follows for end-to-end evaluation:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=[
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

There are NINE optional parameters when instantiating an SummarizationMetric class:

[Optional] threshold: the passing threshold, defaulted to 0.5.
[Optional] assessment_questions: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. If assessment_questions is not provided, we will generate a set of assessment_questions for you at evaluation time. The assessment_questions are used to calculate the coverage_score.
[Optional] n: the number of assessment questions to generate when assessment_questions is not provided. Defaulted to 5.
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to gpt-5.4.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a strict evaluation criterion. In strict mode, the metric score becomes binary: a score of 1 indicates a perfect result, and any outcome less than perfect is scored as 0. Defaulted as False.
[Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
[Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.
[Optional] truths_extraction_limit: an int which when set, determines the maximum number of factual truths to extract from the input. The truths extracted will used to determine the alignment_score, and will be ordered by importance, decided by your evaluation model. Defaulted to None.

Within components

You can also run the SummarizationMetric within nested components for component-level evaluation.

from deepeval.dataset import Golden
from deepeval.tracing import observe, update_current_span
...

@observe(metrics=[metric])
def inner_component():
    # Set test case at runtime
    test_case = LLMTestCase(input="...", actual_output="...")
    update_current_span(test_case=test_case)
    return

@observe
def llm_app(input: str):
    # Component can be anything from an LLM call, retrieval, agent, tool use, etc.
    inner_component()
    return

evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])

As a standalone

You can also run the SummarizationMetric on a single test case as a standalone, one-off execution.

...

metric.measure(test_case)
print(metric.score, metric.reason)

How Is It Calculated?

The SummarizationMetric score is calculated according to the following equation:

\text{Summarization} = \min(\text{Alignment Score}, \text{Coverage Score})

To break it down, the:

alignment_score determines whether the summary contains hallucinated or contradictory information to the original text.
coverage_score determines whether the summary contains the necessary information from the original text.

While the alignment_score is similar to that of the HallucinationMetric, the coverage_score is first calculated by generating n closed-ended questions that can only be answered with either a 'yes or a 'no', before calculating the ratio of which the original text and summary yields the same answer. Here is a great article on how deepeval's summarization metric was build.

You can access the alignment_score and coverage_score from a SummarizationMetric as follows:

from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(...)
metric = SummarizationMetric(...)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.score_breakdown)

Required Arguments

Usage

Within components

As a standalone

How Is It Calculated?

On this page