Arena Test Case
Quick Summary
An arena test case is a blueprint provided by deepeval for you to compare which iteration of your LLM app performed better. It works by comparing each contestants's LLMTestCase to run comparisons, and currently only supports the LLMTestCase for single-turn, text-based comparisons.
The ArenaTestCase currently only runs with the ArenaGEval metric, and all that is required is to provide a list of Contestants:
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant
test_case = ArenaTestCase(contestants=[
Contestant(
name="GPT-4",
hyperparameters={"model": "gpt-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
),
Contestant(
name="Claude-4",
hyperparameters={"model": "claude-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
),
Contestant(
name="Gemini-2.5",
hyperparameters={"model": "gemini-2.5-flash"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Absolutely! The capital of France is Paris 😊",
),
),
])Note that all inputs and expected_outputs you provide across contestants MUST match.
Arena Test Case
The ArenaTestCase takes a simple contestants argument, which is a list of Contestants.
contestant_1 = Contestant(
name="GPT-4",
hyperparameters={"model": "gpt-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
)
contestant_2 = Contestant(
name="Claude-4",
hyperparameters={"model": "claude-4"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
)
contestant_3 = Contestant(
name="Gemini-2.5",
hyperparameters={"model": "gemini-2.5-flash"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Absolutely! The capital of France is Paris 😊",
),
)
test_case = ArenaTestCase(contestants=[contestant_1, contestant_2, contestant_3])Contestant
A Contestant represents a single unit of llm interaction from a specific version of your LLM app. It accepts a test_case, a name to identify the LLM app version that was used to generate the test case, and optionally any hyperparameters associated with the LLM version.
from deepeval.test_case import Contestant, LLMTestCase
from deepeval.prompt import Prompt
contestant_1 = Contestant(
name="GPT-4",
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
hyperparameters={
"model": "gpt-4",
"prompt": Prompt(alias="test_prompt", text_template="You are a helpful assistant."),
},
)Including Images
By default deepeval supports passing both text and images inside your test cases using the MLLMImage object. The MLLMImage class in deepeval is used to reference multimodal images in your test cases. It allows you to create test cases using local images, remote URLs and base64 data.
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant, MLLMImage
shoes = MLLMImage(url='./shoes.png', local=True)
test_case = ArenaTestCase(contestants=[
Contestant(
name="GPT-4",
hyperparameters={"model": "gpt-4"},
test_case=LLMTestCase(
input=f"What's in this image? {shoes}",
actual_output="That's a red shoe",
),
),
Contestant(
name="Claude-4",
hyperparameters={"model": "claude-4"},
test_case=LLMTestCase(
input=f"What's in this image? {shoes}",
actual_output="The image shows a pair of red shoes",
),
)
])MLLMImage Data Model
Here's the data model of the MLLMImage in deepeval:
class MLLMImage:
dataBase64: Optional[str] = None
mimeType: Optional[str] = None
url: Optional[str] = None
local: Optional[bool] = None
filename: Optional[str] = NoneYou MUST either provide url or dataBase64 and mimeType parameters when initializing an MLLMImage. The local attribute should be set to True for locally stored images and False for images hosted online (default is False).
Using Test Cases For Evals
The ArenaGEval metric is the only metric that uses an ArenaTestCase, which picks a "winner" out of the list of contestants:
from deepeval.metrics import ArenaTestCase, SingleTurnParams
...
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winner of the more friendly contestant based on the input and actual output",
evaluation_params=[
SingleTurnParams.INPUT,
SingleTurnParams.ACTUAL_OUTPUT,
],
)
compare(test_cases=[test_case], metric=arena_geval)The ArenaTestCase streamlines the evaluation by automatically masking contestant names (to ensure unbiased judging) and randomizing their order.