LLM Arena Evaluation Quickstart
Learn how to evaluate different versions of your LLM app using LLM Arena-as-a-Judge in deepeval, a comparison-based LLM eval.
Overview
Instead of comparing LLM outputs using a single-output LLM-as-a-Judge method as seen in previous sections, you can also compare n-pairwise test cases to find the best version of your LLM app. This method although does not provide numerical scores, allows you to more reliably choose the "winning" LLM output for a given set of inputs and outputs.
In this 5 min quickstart, you'll learn how to:
- Setup an LLM arena
- Use Arena G-Eval to pick the best performing LLM app
Prerequisites
- Install
deepeval - A Confident AI API key (recommended). Sign up for one here
Confident AI allows you to view and share your testing reports. Set your API key in the CLI:
CONFIDENT_API_KEY="confident_us..."
Setup LLM Arena
In deepeval, arena test cases are used to compare different versions of your LLM app to see which one performs better. Each test case is an arena containing different contestants as different versions of your LLM app which are evaluated based on their corresponding LLMTestCase
deepeval provides a wide selection of LLM models that you can easily choose from and run evaluations with.
- OpenAI
- Anthropic
- Gemini
- Ollama
- Grok
- Azure OpenAI
- Amazon Bedrock
- Vertex AI
from deepeval.metrics import ArenaGEval
task_completion_metric = ArenaGEval(model="gpt-4.1")
from deepeval.metrics import ArenaGEval
from deepeval.models import AnthropicModel
model = AnthropicModel("claude-3-7-sonnet-latest")
task_completion_metric = ArenaGEval(model=model)
from deepeval.metrics import ArenaGEval
from deepeval.models import GeminiModel
model = GeminiModel("gemini-2.5-flash")
task_completion_metric = ArenaGEval(model=model)
from deepeval.metrics import ArenaGEval
from deepeval.models import OllamaModel
model = OllamaModel("deepseek-r1")
task_completion_metric = ArenaGEval(model=model)
from deepeval.metrics import ArenaGEval
from deepeval.models import GrokModel
model = GrokModel("grok-4.1")
task_completion_metric = ArenaGEval(model=model)
from deepeval.metrics import ArenaGEval
from deepeval.models import AzureOpenAIModel
model = AzureOpenAIModel(
model="gpt-4.1",
deployment_name="Test Deployment",
api_key="Your Azure OpenAI API Key",
api_version="2025-01-01-preview",
base_url="https://example-resource.azure.openai.com/",
temperature=0
)
task_completion_metric = ArenaGEval(model=model)
from deepeval.metrics import ArenaGEval
from deepeval.models import AmazonBedrockModel
model = AmazonBedrockModel(
model="anthropic.claude-3-opus-20240229-v1:0",
region="us-east-1",
generation_kwargs={"temperature": 0},
)
task_completion_metric = ArenaGEval(model=model)
from deepeval.metrics import ArenaGEval
from deepeval.models import GeminiModel
model = GeminiModel(
model="gemini-1.5-pro",
project="Your Project ID",
location="us-central1",
temperature=0
)
task_completion_metric = ArenaGEval(model=model)
Create an arena test case
Create an ArenaTestCase by passing a list of contestants.
from deepeval.test_case import ArenaTestCase, LLMTestCase, Contestant
contestant_1 = Contestant(
name="Version 1",
hyperparameters={"model": "gpt-3.5-turbo"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
)
contestant_2 = Contestant(
name="Version 2",
hyperparameters={"model": "gpt-4o"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
)
contestant_3 = Contestant(
name="Version 3",
hyperparameters={"model": "gpt-4.1"},
test_case=LLMTestCase(
input="What is the capital of France?",
actual_output="Absolutely! The capital of France is Paris 😊",
),
)
test_case = ArenaTestCase(contestants=[contestant_1, contestant_2, contestant_3])
You can learn more about an ArenaTestCase here.
Define arena metric
The ArenaGEval metric is the only metric that is compatible with ArenaTestCase. It picks a winner among the contestants based on the criteria defined.
from deepeval.metrics import ArenaGEval
from deepeval.test_case import LLMTestCaseParams
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winner of the more friendly contestant based on the input and actual output",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
]
)
Run Your First Arena Evals
Now that you have created an arena with contestants and defined a metric, you can begin running arena evals to determine the winning contestant.
Run an evaluation
You can run arena evals by using the compare() function.
from deepeval.test_case import ArenaTestCase, LLMTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval
from deepeval import compare
test_case = ArenaTestCase(
contestants=[...], # Use the same contestants you've created before
)
arena_geval = ArenaGEval(...) # Use the same metric you've created before
compare(test_cases=[test_case], metric=arena_geval)
Log prompts and models
You can optionally log prompts and models for each contestant through hyperparameters dictionary in the compare() function. This will allow you to easily attribute winning contestants to their corresponding hyperparameters.
from deepeval.prompt import Prompt
prompt_1 = Prompt(
alias="First Prompt",
messages_template=[PromptMessage(role="system", content="You are a helpful assistant.")]
)
prompt_2 = Prompt(
alias="Second Prompt",
messages_template=[PromptMessage(role="system", content="You are a helpful assistant.")]
)
compare(
test_cases=[test_case],
metric=arena_geval,
hyperparameters={
"Version 1": {"prompt": prompt_1},
"Version 2": {"prompt": prompt_2},
},
)
You can now run this python file to get your results:
python main.py
This should let you see the results of the arena as shown below:
Counter({'Version 3': 1})
🎉🥳 Congratulations! You have just ran your first LLM arena-based evaluation. Here's what happened:
- When you call
compare(),deepevalloops through eachArenaTestCase - For each test case,
deepevaluses theArenaGEvalmetric to pick the "winner" - To make the arena unbiased,
deepevalmasks the names of each contestant and randomizes their positions - In the end, you get the number of "wins" each contestant got as the final output.
Unlike single-output LLM-as-a-Judge (which is everything but LLM arena evals), the concept of a "passing" test case does not exist for arena evals.
View on Confident AI (recommended)
If you've set your CONFIDENT_API_KEY, your arena comparisons will automatically appear as an experiment on Confident AI, the DeepEval platform.
Next Steps
deepeval lets you run Arena comparisons locally but isn’t optimized for iterative prompt or model improvements. If you’re looking for a more comprehensive and streamlined way to run Arena comparisons, Confident AI (DeepEval Cloud) enables you to easily test different prompts, models, tools, and output configurations side by side, and evaluate them using any deepeval metric beyond ArenaGEval—all directly on the platform.
- Quick Comparisons
- Experiments
- Traced Comparisons
- Metric Comparisons
- Log Prompts and Models
Compare model outputs directly using arena evaluations.
Create an experiment to run comprehensive comparisons on an evaluation dataset and set of metrics.
View detailed traces of LLM and tool calls during model comparisons.
Apply custom evaluation metrics to determine winning models in head-to-head comparisons.
Track prompts and model configurations to understand which hyperparameters lead to better performance.
Now that you have run your first Arena evals, you should:
- Customize your metrics: You can change the criteria of your metric to be more specific to your use-case.
- Prepare a dataset: If you don't have one, generate one as a starting point to store your inputs as goldens.
The arena metric is only used for picking winners among the contestants, it's not used for evaluating the answers themselves. To evaluate your LLM application on specific use cases you can read the other quickstarts here: