🔥 DeepEval 4.0 just got released. Read the announcement.
ConceptsTest Cases

Arena Test Case

Quick Summary

An arena test case is a blueprint provided by deepeval for you to compare which iteration of your LLM app performed better. It works by comparing each contestants's LLMTestCase to run comparisons, and currently only supports the LLMTestCase for single-turn, text-based comparisons.

The ArenaTestCase currently only runs with the ArenaGEval metric, and all that is required is to provide a list of Contestants:

main.py
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant

test_case = ArenaTestCase(contestants=[
    Contestant(
        name="GPT-4",
        hyperparameters={"model": "gpt-4"},
        test_case=LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris",
        ),
    ),
    Contestant(
        name="Claude-4",
        hyperparameters={"model": "claude-4"},
        test_case=LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris is the capital of France.",
        ),
    ),
    Contestant(
        name="Gemini-2.5",
        hyperparameters={"model": "gemini-2.5-flash"},
        test_case=LLMTestCase(
            input="What is the capital of France?",
            actual_output="Absolutely! The capital of France is Paris 😊",
        ),
    ),
])

Note that all inputs and expected_outputs you provide across contestants MUST match.

Arena Test Case

The ArenaTestCase takes a simple contestants argument, which is a list of Contestants.

contestant_1 = Contestant(
    name="GPT-4",
    hyperparameters={"model": "gpt-4"},
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris",
    ),
)

contestant_2 = Contestant(
    name="Claude-4",
    hyperparameters={"model": "claude-4"},
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
    ),
)

contestant_3 = Contestant(
    name="Gemini-2.5",
    hyperparameters={"model": "gemini-2.5-flash"},
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Absolutely! The capital of France is Paris 😊",
    ),
)

test_case = ArenaTestCase(contestants=[contestant_1, contestant_2, contestant_3])

Contestant

A Contestant represents a single unit of llm interaction from a specific version of your LLM app. It accepts a test_case, a name to identify the LLM app version that was used to generate the test case, and optionally any hyperparameters associated with the LLM version.

from deepeval.test_case import Contestant, LLMTestCase
from deepeval.prompt import Prompt

contestant_1 = Contestant(
    name="GPT-4",
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris",
    ),
    hyperparameters={
        "model": "gpt-4",
        "prompt": Prompt(alias="test_prompt", text_template="You are a helpful assistant."),
    },
)

Including Images

By default deepeval supports passing both text and images inside your test cases using the MLLMImage object. The MLLMImage class in deepeval is used to reference multimodal images in your test cases. It allows you to create test cases using local images, remote URLs and base64 data.

from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant, MLLMImage

shoes = MLLMImage(url='./shoes.png', local=True)

test_case = ArenaTestCase(contestants=[
    Contestant(
        name="GPT-4",
        hyperparameters={"model": "gpt-4"},
        test_case=LLMTestCase(
            input=f"What's in this image? {shoes}",
            actual_output="That's a red shoe",
        ),
    ),
    Contestant(
        name="Claude-4",
        hyperparameters={"model": "claude-4"},
        test_case=LLMTestCase(
            input=f"What's in this image? {shoes}",
            actual_output="The image shows a pair of red shoes",
        ),
    )
])

MLLMImage Data Model

Here's the data model of the MLLMImage in deepeval:

class MLLMImage:
    dataBase64: Optional[str] = None
    mimeType: Optional[str] = None
    url: Optional[str] = None
    local: Optional[bool] = None
    filename: Optional[str] = None

You MUST either provide url or dataBase64 and mimeType parameters when initializing an MLLMImage. The local attribute should be set to True for locally stored images and False for images hosted online (default is False).

Using Test Cases For Evals

The ArenaGEval metric is the only metric that uses an ArenaTestCase, which picks a "winner" out of the list of contestants:

from deepeval.metrics import ArenaTestCase, SingleTurnParams
...

arena_geval = ArenaGEval(
    name="Friendly",
    criteria="Choose the winner of the more friendly contestant based on the input and actual output",
    evaluation_params=[
        SingleTurnParams.INPUT,
        SingleTurnParams.ACTUAL_OUTPUT,
    ],
)

compare(test_cases=[test_case], metric=arena_geval)

The ArenaTestCase streamlines the evaluation by automatically masking contestant names (to ensure unbiased judging) and randomizing their order.

FAQs

What is an ArenaTestCase used for?
It compares multiple candidates (contestants) for the same input and picks a winner, instead of scoring a single output in isolation. It's ideal for choosing between models, prompts, or configurations head to head.
Which metric works with an ArenaTestCase?
The ArenaGEval metric is the only metric that consumes an ArenaTestCase. It uses your criteria to pick the winning contestant via compare().
How does deepeval prevent bias toward a particular contestant?
The ArenaTestCase automatically masks contestant names and randomizes their order before judging, so the judge can't be influenced by naming or position.
What's the difference between arena and regular single-turn evals?
A regular LLMTestCase scores one output against a metric threshold (absolute scoring). An arena test case is relative — it asks which candidate is best for the same input rather than whether a single output passes.
Can arena test cases include images?
Yes. Each contestant can include MLLMImage objects, so you can run head-to-head comparisons on multimodal outputs.
Can my team run these comparisons on the cloud and visualize the winners in a UI?
Arena comparisons run locally out of the box. If your team wants a shared, cloud-hosted view of which candidate won and why, Confident AI (built by the deepeval team) can visualize comparison results in a UI for collaborative review — optional, and your compare() runs work the same without it.

On this page