LLM Arena Evaluation
Learn how to evaluate different versions of your LLM app using LLM Arena-as-a-Judge in deepeval
, a comparison-based LLM eval.
Overview
Instead of comparing LLM outputs using a single-output LLM-as-a-Judge method as seen in previous sections, you can also compare n-pairwise test cases to find the best version of your LLM app. This method although does not provide numerical scores, allows you to more reliably choose the "winning" LLM output for a given set of inputs and outputs.
In this 5 min quickstart, you'll learn how to:
- Setup an LLM arena
- Use Arena G-Eval to pick the best performing LLM app
Prerequisites
- Install
deepeval
Setup LLM Arena
In deepeval
, arena test cases are used to compare different versions of your LLM app to see which one performs better. Each test case is an arena containing different contestants as different versions of your LLM app which are evaluated based on their corresponding LLMTestCase
Create an arena test case
Create an ArenaTestCase
by passing a dictionary of contestants with version names as keys and their corresponding LLMTestCase
as values.
from deepeval.test_case import ArenaTestCase, LLMTestCase
test_case = ArenaTestCase(
contestants={
"Version 1": LLMTestCase(
input='Who wrote the novel "1984"?',
actual_output="George Orwell",
),
"Version 2": LLMTestCase(
input='Who wrote the novel "1984"?',
actual_output='"1984" was written by George Orwell.',
),
"Version 3": LLMTestCase(
input='Who wrote the novel "1984"?',
actual_output="That dystopian masterpiece was penned by George Orwell 📚",
),
"Version 4": LLMTestCase(
input='Who wrote the novel "1984"?',
actual_output="George Orwell is the brilliant mind behind the novel '1984'.",
),
},
)
You can learn more about LLMTestCase
here.
Define arena metric
The ArenaGEval
metric is the only metric that is compatible with ArenaTestCase
. It picks a winner among the contestants based on the criteria defined.
from deepeval.metrics import ArenaTestCase, LLMTestCaseParams
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winner of the more friendly contestant based on the input and actual output",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
)
Run Your First Arena Evals
Now that you have created an arena with contestants and defined a metric, you can run evals by using the compare()
method:
from deepeval.test_case import ArenaTestCase, LLMTestCase
from deepeval.metrics import ArenaTestCase, LLMTestCaseParams
from deepeval import compare
test_case = ArenaTestCase(
contestants={...}, # Use the same pairs you've created before
)
arena_geval = ArenaGEval(...) # Use the same metric you've created before
compare(test_cases=[test_case], metric=arena_geval)
You can now run this python file to get your results:
python main.py
This should let you see the results of the arena as shown below:
Counter({'Version 3': 1})
🎉🥳 Congratulations! You have just ran your first LLM arena-based evaluation. Here's what happened:
- When you call
compare()
,deepeval
loops through eachArenaTestCase
- For each test case,
deepeval
uses theArenaGEval
metric to pick the "winner" - To make the arena unbiased,
deepeval
masks the names of each contestant and randomizes their positions - In the end, you get the number of "wins" each contestant got as the final output.
Unlike single-output LLM-as-a-Judge (which is everything but LLM arena evals), the concept of a "passing" test case does not exist for arena evals.
Next Steps
Now that you have run your first Arena evals, you should:
- Customize your metrics: You can change the criteria of your metric to be more specific to your use-case.
- Prepare a dataset: If you don't have one, generate one as a starting point to store your inputs as goldens.
The arena metric is only used for picking winners among the contestants, it's not used for evaluating the answers themselves. To evaluate your LLM application on specific use cases you can read the other quickstarts here: