🔥 DeepEval 4.0 just got released. Read the announcement.
Custom

Conversational G-Eval

LLM-as-a-judge
Custom
Multi-turn
Chatbot
Multimodal

The conversational G-Eval is an adopted version of deepeval's popular GEval metric but for evaluating entire conversations instead.

It is currently the best way to define custom criteria to evaluate multi-turn conversations in deepeval. By defining a custom ConversationalGEval, you can easily determine whether your LLM chatbot is able to consistently generate responses that are up to standard with your custom criteria throughout a conversation.

Required Arguments

To use the ConversationalGEval metric, you'll have to provide the following arguments when creating a ConversationalTestCase:

  • turns

You'll also want to supply any additional arguments such as retrieval_context and tools_called in turns if your evaluation criteria depends on these parameters.

Usage

To create a custom metric that evaluates entire LLM conversations, simply instantiate a ConversationalGEval class and define an evaluation criteria in everyday language:

from deepeval import evaluate
from deepeval.test_case import Turn, MultiTurnParams, ConversationalTestCase
from deepeval.metrics import ConversationalGEval

convo_test_case = ConversationalTestCase(
    turns=[Turn(role="...", content="..."), Turn(role="...", content="...")]
)
metric = ConversationalGEval(
    name="Professionalism",
    criteria="Determine whether the assistant has acted professionally based on the content."
)

# To run metric as a standalone
# metric.measure(convo_test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[convo_test_case], metrics=[metric])

There are THREE mandatory and SIX optional parameters required when instantiating an ConversationalGEval class:

  • name: name of metric. This will not affect the evaluation.
  • criteria: a description outlining the specific evaluation aspects for each test case.
  • [Optional] evaluation_params: a list of type MultiTurnParams, include only the parameters that are relevant for evaluation. Defaulted to [MultiTurnParams.CONTENT].
  • [Optional] evaluation_steps: a list of strings outlining the exact steps the LLM should take for evaluation. If evaluation_steps is not provided, ConversationalGEval will generate a series of evaluation_steps on your behalf based on the provided criteria. You can only provide either evaluation_steps OR criteria, and not both.
  • [Optional] threshold: the passing threshold, defaulted to 0.5.
  • [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to gpt-5.4.
  • [Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
  • [Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
  • [Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.
  • [Optional] evaluation_template: a class of type ConversationalGEvalTemplate, which allows you to override the default prompts used to compute the ConversationalGEval score. Defaulted to deepeval's ConversationalGEvalTemplate.

As a standalone

You can also run the ConversationalGEval on a single test case as a standalone, one-off execution.

...

metric.measure(convo_test_case)
print(metric.score, metric.reason)

How Is It Calculated?

The ConversationalGEval is an adapted version of GEval, so alike GEval, the ConversationalGEval metric is a two-step algorithm that first generates a series of evaluation_steps using chain of thoughts (CoTs) based on the given criteria, before using the generated evaluation_steps to determine the final score using the evaluation_params presented in each turn.

Unlike regular GEval though, the ConversationalGEval takes the entire conversation history into account during evaluation.

Customize Your Template

Since deepeval's ConversationalGEval is evaluated by LLM-as-a-judge, you can likely improve your metric accuracy by overriding deepeval's default prompt templates. This is especially helpful if:

  • You're using a custom evaluation LLM, especially for smaller models that have weaker instruction following capabilities.
  • You want to customize the examples used in the default ConversationalGEvalTemplate to better align with your expectations.

Here's a quick example of how you can override the process of extracting claims in the ConversationalGEval algorithm:

from deepeval.metrics import ConversationalGEval
from deepeval.metrics.conversational_g_eval import ConversationalGEvalTemplate
import textwrap


class CustomConvoGEvalTemplate(ConversationalGEvalTemplate):
    @staticmethod
    def generate_evaluation_steps(parameters: str, criteria: str):
        return textwrap.dedent(
            f"""
            You are given criteria for evaluating a conversation based on the following parameters: {parameters}.
            Write 3-4 clear and concise evaluation steps that describe how to judge the quality of each turn and the conversation overall.

            Criteria:
            {criteria}

            Return JSON only in the format:
            {{
                "steps": [
                    "Step 1",
                    "Step 2",
                    "Step 3"
                ]
            }}

            JSON:
            """
        )

# Inject custom template to metric
metric = ConversationalGEval(evaluation_template=CustomConvoGEvalTemplate)
metric.measure(...)

FAQs

On this page