Conversational G-Eval
The conversational G-Eval is an adopted version of deepeval's popular GEval metric but for evaluating entire conversations instead.
It is currently the best way to define custom criteria to evaluate multi-turn conversations in deepeval. By defining a custom ConversationalGEval, you can easily determine whether your LLM chatbot is able to consistently generate responses that are up to standard with your custom criteria throughout a conversation.
Required Arguments
To use the ConversationalGEval metric, you'll have to provide the following arguments when creating a ConversationalTestCase:
turns
You'll also want to supply any additional arguments such as retrieval_context and tools_called in turns if your evaluation criteria depends on these parameters.
Usage
To create a custom metric that evaluates entire LLM conversations, simply instantiate a ConversationalGEval class and define an evaluation criteria in everyday language:
from deepeval import evaluate
from deepeval.test_case import Turn, MultiTurnParams, ConversationalTestCase
from deepeval.metrics import ConversationalGEval
convo_test_case = ConversationalTestCase(
turns=[Turn(role="...", content="..."), Turn(role="...", content="...")]
)
metric = ConversationalGEval(
name="Professionalism",
criteria="Determine whether the assistant has acted professionally based on the content."
)
# To run metric as a standalone
# metric.measure(convo_test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[convo_test_case], metrics=[metric])There are THREE mandatory and SIX optional parameters required when instantiating an ConversationalGEval class:
name: name of metric. This will not affect the evaluation.criteria: a description outlining the specific evaluation aspects for each test case.- [Optional]
evaluation_params: a list of typeMultiTurnParams, include only the parameters that are relevant for evaluation. Defaulted to[MultiTurnParams.CONTENT]. - [Optional]
evaluation_steps: a list of strings outlining the exact steps the LLM should take for evaluation. Ifevaluation_stepsis not provided,ConversationalGEvalwill generate a series ofevaluation_stepson your behalf based on the providedcriteria. You can only provide eitherevaluation_stepsORcriteria, and not both. - [Optional]
threshold: the passing threshold, defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-5.4. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse. - [Optional]
evaluation_template: a class of typeConversationalGEvalTemplate, which allows you to override the default prompts used to compute theConversationalGEvalscore. Defaulted todeepeval'sConversationalGEvalTemplate.
As a standalone
You can also run the ConversationalGEval on a single test case as a standalone, one-off execution.
...
metric.measure(convo_test_case)
print(metric.score, metric.reason)How Is It Calculated?
The ConversationalGEval is an adapted version of GEval, so alike GEval, the ConversationalGEval metric is a two-step algorithm that first generates a series of evaluation_steps using chain of thoughts (CoTs) based on the given criteria, before using the generated evaluation_steps to determine the final score using the evaluation_params presented in each turn.
Unlike regular GEval though, the ConversationalGEval takes the entire conversation history into account during evaluation.
Customize Your Template
Since deepeval's ConversationalGEval is evaluated by LLM-as-a-judge, you can likely improve your metric accuracy by overriding deepeval's default prompt templates. This is especially helpful if:
- You're using a custom evaluation LLM, especially for smaller models that have weaker instruction following capabilities.
- You want to customize the examples used in the default
ConversationalGEvalTemplateto better align with your expectations.
Here's a quick example of how you can override the process of extracting claims in the ConversationalGEval algorithm:
from deepeval.metrics import ConversationalGEval
from deepeval.metrics.conversational_g_eval import ConversationalGEvalTemplate
import textwrap
class CustomConvoGEvalTemplate(ConversationalGEvalTemplate):
@staticmethod
def generate_evaluation_steps(parameters: str, criteria: str):
return textwrap.dedent(
f"""
You are given criteria for evaluating a conversation based on the following parameters: {parameters}.
Write 3-4 clear and concise evaluation steps that describe how to judge the quality of each turn and the conversation overall.
Criteria:
{criteria}
Return JSON only in the format:
{{
"steps": [
"Step 1",
"Step 2",
"Step 3"
]
}}
JSON:
"""
)
# Inject custom template to metric
metric = ConversationalGEval(evaluation_template=CustomConvoGEvalTemplate)
metric.measure(...)