Arena Test Case
Quick Summary
An arena test case is a blueprint provided by deepeval for you to compare which iteration of your LLM app performed better. It works by comparing each contestants's LLMTestCase to run comparisons, and currently only supports the LLMTestCase for single-turn, text-based comparisons.
Support for ConversationalTestCase is coming soon.
The ArenaTestCase currently only runs with the ArenaGEval metric, and all that is required is to provide a list of Contestants:
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant
test_case = ArenaTestCase(contestants=[
Contestant(
name="GPT-4",
hyperparameters={"model": "gpt-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
),
Contestant(
name="Claude-4",
hyperparameters={"model": "claude-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
),
Contestant(
name="Gemini-2.5",
hyperparameters={"model": "gemini-2.5-flash"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Absolutely! The capital of France is Paris 😊",
),
),
])
Note that all inputs and expected_outputs you provide across contestants MUST match.
For those wondering why we took the choice to include multiple duplicated inputs in LLMTestCase instead of moving it to the ArenaTestCase class, it is because an LLMTestCase integrates nicely with the existing ecosystem.
You also shouldn't worry about unexpected errors because deepeval will throw an error if inputs or expected_outputs aren't matching.
Arena Test Case
The ArenaTestCase takes a simple contestants argument, which is a list of Contestants.
contestant_1 = Contestant(
name="GPT-4",
hyperparameters={"model": "gpt-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
)
contestant_2 = Contestant(
name="Claude-4",
hyperparameters={"model": "claude-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
)
contestant_3 = Contestant(
name="Gemini-2.5",
hyperparameters={"model": "gemini-2.5-flash"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Absolutely! The capital of France is Paris 😊",
),
)
test_case = ArenaTestCase(contestants=[contestant_1, contestant_2, contestant_3])
Contestant
A Contestant represents a single unit of llm interaction from a specific version of your LLM app. It accepts a test_case, a name to identify the LLM app version that was used to generate the test case, and optionally any hyperparameters associated with the LLM version.
from deepeval.test_case import Contestant, LLMTestCase
from deepeval.prompt import Prompt
contestant_1 = Contestant(
name="GPT-4",
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
hyperparameters={
"model": "gpt-4",
"prompt": Prompt(alias="test_prompt", text_template="You are a helpful assistant."),
},
)
Including Images
By default deepeval supports passing both text and images inside your test cases using the MLLMImage object. The MLLMImage class in deepeval is used to reference multimodal images in your test cases. It allows you to create test cases using local images, remote URLs and base64 data.
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant, MLLMImage
shoes = MLLMImage(url='./shoes.png', local=True)
test_case = ArenaTestCase(contestants=[
Contestant(
name="GPT-4",
hyperparameters={"model": "gpt-4"},
test_case=LLMTestCase(
input=f"What's in this image? {shoes}",
actual_output="That's a red shoe",
),
),
Contestant(
name="Claude-4",
hyperparameters={"model": "claude-4"},
test_case=LLMTestCase(
input=f"What's in this image? {shoes}",
actual_output="The image shows a pair of red shoes",
),
)
])
Multimodal test cases are automatically detected when you include MLLMImage objects in your inputs or outputs of your LLMTestCases. You can use the ArenaGEval metric to run evaluations for your multimodal test cases as usual.
MLLMImage Data Model
Here's the data model of the MLLMImage in deepeval:
class MLLMImage:
dataBase64: Optional[str] = None
mimeType: Optional[str] = None
url: Optional[str] = None
local: Optional[bool] = None
filename: Optional[str] = None
You MUST either provide url or dataBase64 and mimeType parameters when initializing an MLLMImage. The local attribute should be set to True for locally stored images and False for images hosted online (default is False).
All the MLLMImage instances are converted to a special deepeval slug, (e.g [DEEPEVAL:IMAGE:uuid]). This is how your MLLMImages look like in your test cases after you embed them in f-strings:
from deepeval.test_case import LLMTestCase, MLLMImage
shoes = MLLMImage(url='./shoes.png', local=True)
test_case = LLMTestCase(
input=f"Change the color of these shoes to blue: {shoes}",
expected_output=f"..."
)
print(test_case.input)
This outputs the following:
Change the color of these shoes to blue: [DEEPEVAL:IMAGE:awefv234fvbnhg456]
Users who'd like to access their images themselves for any ETL can use the convert_to_multi_modal_array method to convert your test cases to a list of strings and MLLMImage in order. Here's how to use it:
from deepeval.test_case import LLMTestCase, MLLMImage
from deepeval.utils import convert_to_multi_modal_array
shoes = MLLMImage(url='./shoes.png', local=True)
test_case = LLMTestCase(
input=f"Change the color of these shoes to blue: {shoes}",
expected_output=f"..."
)
print(convert_to_multi_modal_array(test_case.input))
This will output the following:
["Change the color of these shoes to blue:", [DEEPEVAL:IMAGE:awefv234fvbnhg456]]
The [DEEPEVAL:IMAGE:awefv234fvbnhg456] here is actually the instance of MLLMImage you passed inside your test case.
Using Test Cases For Evals
The ArenaGEval metric is the only metric that uses an ArenaTestCase, which picks a "winner" out of the list of contestants:
from deepeval.metrics import ArenaTestCase, LLMTestCaseParams
...
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winner of the more friendly contestant based on the input and actual output",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
)
compare(test_cases=[test_case], metric=arena_geval)
The ArenaTestCase streamlines the evaluation by automatically masking contestant names (to ensure unbiased judging) and randomizing their order.