Skip to main content

Arena G-Eval

The arena G-Eval is an adopted version of deepeval's popular GEval metric but for choosing which LLMTestCase performed better instead.

It is currently the best way to make comparisons across different iterations of your LLM app.

Required Arguments

To use the ArenaGEval metric, you'll have to provide the following arguments when creating an ArenaTestCase:

  • contestants

You'll also need to supply any additional arguments such as expected_output and context within the LLMTestCase of contestants if your evaluation criteria depends on these parameters.

Usage

To create a custom metric that chooses the best LLMTestCase, simply instantiate a ArenaGEval class and define an evaluation criteria in everyday language:

from deepeval import evaluate
from deepeval.test_case import ArenaTestCase, LLMTestCaseParams
from deepeval.metrics import ArenaGEval

a_test_case = ArenaTestCase(
contestants={
"GPT-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris",
),
"Claude-4": LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
),
},
)
arena_geval = ArenaGEval(
name="Friendly",
criteria="Choose the winter of the more friendly contestant based on the input and actual output",
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
)


metric.measure(a_test_case)
print(metric.winner, metric.reason)

There are THREE mandatory and FOUR optional parameters required when instantiating an ArenaGEval class:

  • name: name of metric. This will not affect the evaluation.
  • criteria: a description outlining the specific evaluation aspects for each test case.
  • evaluation_params: a list of type LLMTestCaseParams, include only the parameters that are relevant for evaluation..
  • [Optional] evaluation_steps: a list of strings outlining the exact steps the LLM should take for evaluation. If evaluation_steps is not provided, ConversationalGEval will generate a series of evaluation_steps on your behalf based on the provided criteria. You can only provide either evaluation_steps OR criteria, and not both.
  • [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
  • [Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
  • [Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.
danger

For accurate and valid results, only evaluation parameters that are mentioned in criteria/evaluation_steps should be included as a member of evaluation_params.

How Is It Calculated?

The ArenaGEval is an adapted version of GEval, so alike GEval, the ArenaGEval metric is a two-step algorithm that first generates a series of evaluation_steps using chain of thoughts (CoTs) based on the given criteria, before using the generated evaluation_steps to determine the winner based on the evaluation_params presented in each LLMTestCase.