🔥 Vibe coding for DeepEval is here. Get started now.
ConceptsTest Cases

Arena Test Case

Quick Summary

An arena test case is a blueprint provided by deepeval for you to compare which iteration of your LLM app performed better. It works by comparing each contestants's LLMTestCase to run comparisons, and currently only supports the LLMTestCase for single-turn, text-based comparisons.

The ArenaTestCase currently only runs with the ArenaGEval metric, and all that is required is to provide a list of Contestants:

main.py
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant

test_case = ArenaTestCase(contestants=[
    Contestant(
        name="GPT-4",
        hyperparameters={"model": "gpt-4"},
        test_case=LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris",
        ),
    ),
    Contestant(
        name="Claude-4",
        hyperparameters={"model": "claude-4"},
        test_case=LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris is the capital of France.",
        ),
    ),
    Contestant(
        name="Gemini-2.5",
        hyperparameters={"model": "gemini-2.5-flash"},
        test_case=LLMTestCase(
            input="What is the capital of France?",
            actual_output="Absolutely! The capital of France is Paris 😊",
        ),
    ),
])

Note that all inputs and expected_outputs you provide across contestants MUST match.

Arena Test Case

The ArenaTestCase takes a simple contestants argument, which is a list of Contestants.

contestant_1 = Contestant(
    name="GPT-4",
    hyperparameters={"model": "gpt-4"},
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris",
    ),
)

contestant_2 = Contestant(
    name="Claude-4",
    hyperparameters={"model": "claude-4"},
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris is the capital of France.",
    ),
)

contestant_3 = Contestant(
    name="Gemini-2.5",
    hyperparameters={"model": "gemini-2.5-flash"},
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Absolutely! The capital of France is Paris 😊",
    ),
)

test_case = ArenaTestCase(contestants=[contestant_1, contestant_2, contestant_3])

Contestant

A Contestant represents a single unit of llm interaction from a specific version of your LLM app. It accepts a test_case, a name to identify the LLM app version that was used to generate the test case, and optionally any hyperparameters associated with the LLM version.

from deepeval.test_case import Contestant, LLMTestCase
from deepeval.prompt import Prompt

contestant_1 = Contestant(
    name="GPT-4",
    test_case=LLMTestCase(
        input="What is the capital of France?",
        actual_output="Paris",
    ),
    hyperparameters={
        "model": "gpt-4",
        "prompt": Prompt(alias="test_prompt", text_template="You are a helpful assistant."),
    },
)

Including Images

By default deepeval supports passing both text and images inside your test cases using the MLLMImage object. The MLLMImage class in deepeval is used to reference multimodal images in your test cases. It allows you to create test cases using local images, remote URLs and base64 data.

from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant, MLLMImage

shoes = MLLMImage(url='./shoes.png', local=True)

test_case = ArenaTestCase(contestants=[
    Contestant(
        name="GPT-4",
        hyperparameters={"model": "gpt-4"},
        test_case=LLMTestCase(
            input=f"What's in this image? {shoes}",
            actual_output="That's a red shoe",
        ),
    ),
    Contestant(
        name="Claude-4",
        hyperparameters={"model": "claude-4"},
        test_case=LLMTestCase(
            input=f"What's in this image? {shoes}",
            actual_output="The image shows a pair of red shoes",
        ),
    )
])

MLLMImage Data Model

Here's the data model of the MLLMImage in deepeval:

class MLLMImage:
    dataBase64: Optional[str] = None
    mimeType: Optional[str] = None
    url: Optional[str] = None
    local: Optional[bool] = None
    filename: Optional[str] = None

You MUST either provide url or dataBase64 and mimeType parameters when initializing an MLLMImage. The local attribute should be set to True for locally stored images and False for images hosted online (default is False).

Using Test Cases For Evals

The ArenaGEval metric is the only metric that uses an ArenaTestCase, which picks a "winner" out of the list of contestants:

from deepeval.metrics import ArenaTestCase, SingleTurnParams
...

arena_geval = ArenaGEval(
    name="Friendly",
    criteria="Choose the winner of the more friendly contestant based on the input and actual output",
    evaluation_params=[
        SingleTurnParams.INPUT,
        SingleTurnParams.ACTUAL_OUTPUT,
    ],
)

compare(test_cases=[test_case], metric=arena_geval)

The ArenaTestCase streamlines the evaluation by automatically masking contestant names (to ensure unbiased judging) and randomizing their order.

On this page