Arena Test Case
Quick Summary
An arena test case is a blueprint provided by deepeval for you to compare which iteration of your LLM app performed better. It works by comparing each contestants's LLMTestCase to run comparisons, and currently only supports the LLMTestCase for single-turn, text-based comparisons.
Support for ConversationalTestCase and MLLMTestCase is coming soon.
The ArenaTestCase currently only runs with the ArenaGEval metric, and all that is required is to provide a dictionary of contestant names to test cases:
from deepeval.test_case import ArenaTestCase, LLMTestCase
test_case = ArenaTestCase(
contestants={
"GPT-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
"Claude-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
"Gemini 2.0": LLMTestCase(
input="What is the capital of France?",
actual_output="Absolutely! The capital of France is Paris 😊",
),
"Deepseek R1": LLMTestCase(
input="What is the capital of France?",
actual_output="Hey there! It’s Paris—the beautiful City of Light. Have a wonderful day!",
),
},
)
Note that all inputs and expected_outputs you provide across contestants MUST match.
For those wondering why we took the choice to include multiple duplicated inputs in LLMTestCase instead of moving it to the ArenaTestCase class, it is because an LLMTestCase integrates nicely with the existing ecosystem.
You also shouldn't worry about unexpected errors because deepeval will throw an error if inputs or expected_outputs aren't matching.
Arena Test Case
The ArenaTestCase takes a simple contestants argument, which is a dictionary of contestant names to LLMTestCases:
contestants = {
"GPT-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
"Claude-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
"Gemini 2.0": LLMTestCase(
input="What is the capital of France?",
actual_output="Absolutely! The capital of France is Paris 😊",
),
"Deepseek R1": LLMTestCase(
input="What is the capital of France?",
actual_output="Hey there! It’s Paris—the beautiful City of Light. Have a wonderful day!",
),
}
test_case = ArenaTestCase(contestants=contestants)
The ArenaGEval metric is the only metric that uses an ArenaTestCase, which pickes a "winner" out of the list of contestants:
from deepeval.metrics import ArenaTestCase, LLMTestCaseParams
...
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winner of the more friendly contestant based on the input and actual output",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
)
arena_geval.measure(test_case)
print(arena_geval.winner, arena_geval.reason)
The ArenaTestCase streamlines the evaluation by automatically masking contestant names (to ensure unbiased judging) and randomizing their order.