Arena G-Eval
The arena G-Eval is an adopted version of deepeval
's popular GEval
metric but for choosing which LLMTestCase
performed better instead.
It is currently the best way to make comparisons across different iterations of your LLM app.
Required Arguments
To use the ArenaGEval
metric, you'll have to provide the following arguments when creating an ArenaTestCase
:
contestants
You'll also need to supply any additional arguments such as expected_output
and context
within the LLMTestCase
of contestants
if your evaluation criteria depends on these parameters.
Usage
To create a custom metric that chooses the best LLMTestCase
, simply instantiate a ArenaGEval
class and define an evaluation criteria in everyday language:
from deepeval import evaluate
from deepeval.test_case import ArenaTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval
a_test_case = ArenaTestCase(
contestants={
"GPT-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
"Claude-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
},
)
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winter of the more friendly contestant based on the input and actual output",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
)
metric.measure(a_test_case)
print(metric.winner, metric.reason)
There are THREE mandatory and FOUR optional parameters required when instantiating an ArenaGEval
class:
name
: name of metric. This will not affect the evaluation.criteria
: a description outlining the specific evaluation aspects for each test case.evaluation_params
: a list of typeLLMTestCaseParams
, include only the parameters that are relevant for evaluation..- [Optional]
evaluation_steps
: a list of strings outlining the exact steps the LLM should take for evaluation. Ifevaluation_steps
is not provided,ConversationalGEval
will generate a series ofevaluation_steps
on your behalf based on the providedcriteria
. You can only provide eitherevaluation_steps
ORcriteria
, and not both. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
For accurate and valid results, only evaluation parameters that are mentioned in criteria
/evaluation_steps
should be included as a member of evaluation_params
.
How Is It Calculated?
The ArenaGEval
is an adapted version of GEval
, so alike GEval
, the ArenaGEval
metric is a two-step algorithm that first generates a series of evaluation_steps
using chain of thoughts (CoTs) based on the given criteria
, before using the generated evaluation_steps
to determine the winner based on the evaluation_params
presented in each LLMTestCase
.