DeepEval just got a new look πŸŽ‰ Read the announcement to learn more.
Images

Image Reference

LLM-as-a-judge
Custom
Single-turn
Multimodal

The Image Reference metric evaluates how accurately images are referred to or explained by accompanying text. deepeval's Image Reference metric is self-explaining within MLLM-Eval, meaning it provides a rationale for its assigned score.

Required Arguments

To use the ImageReference, you'll have to provide the following arguments when creating a LLMTestCase:

  • input
  • actual_output

The input and actual_output are required to create an LLMTestCase (and hence required by all metrics) even though they might not be used for metric calculation. Read the How Is It Calculated section below to learn more.

Usage

from deepeval.test_case import LLMTestCase, MLLMImage
from deepeval.metrics import ImageReferenceMetric
from deepeval import evaluate

metric = ImageReferenceMetric(
    threshold=0.7,
    include_reason=True,
)
m_test_case = LLMTestCase(
    input=f"Provide step-by-step instructions on how to fold a paper airplane.",
    # Replace with your MLLM app output
    actual_output=f"""
        1. Take the sheet of paper and fold it lengthwise:
        {MLLMImage(url="./paper_plane_1", local=True)}
        2. Unfold the paper. Fold the top left and right corners towards the center.
        {MLLMImage(url="./paper_plane_2", local=True)}
        ...
    """
)


evaluate(test_cases=[m_test_case], metrics=[metric])

There are FIVE optional parameters when creating a ImageReferenceMetric:

  • [Optional] threshold: a float representing the minimum passing threshold, defaulted to 0.5.
  • [Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
  • [Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
  • [Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.
  • [Optional] max_context_size: a number representing the maximum number of characters in each context, as outlined in the How Is It Calculated section. Defaulted to None.

As a standalone

You can also run the ImageReferenceMetric on a single test case as a standalone, one-off execution.

...

metric.measure(m_test_case)
print(metric.score, metric.reason)

How Is It Calculated?

The ImageReference score is calculated as follows:

  1. Individual Image Reference: Each image's reference score is based on the text directly above and below the image, limited by a max_context_size in characters. If max_context_size is not supplied, all available text is used. The equation can be expressed as:
Ri=f(Contextabove,Contextbelow,Imagei)R_i = f(\text{Context}_{\text{above}}, \text{Context}_{\text{below}}, \text{Image}_i)
  1. Final Score: The overall ImageReference score is the average of all individual image reference scores for each image:
O=βˆ‘i=1nRinO = \frac{\sum_{i=1}^n R_i}{n}

On this page