DAG (Deep Acyclic Graph)

LLM-as-a-judge

Custom

Single-turn

Multimodal

The deep acyclic graph (DAG) metric in deepeval is currently the most versatile custom metric for you to easily build deterministic decision trees for evaluation with the help of using LLM-as-a-judge.

The DAGMetric gives you more deterministic control over GEval. You can however also use GEval, or any other default metric in deepeval, within your DAGMetric.

Should I use DAG or G-Eval?

If you were to do this using GEval, your evaluation_steps might look something like this:

The summary is completely wrong if it misses any of the headings: "intro", "body", "conclusion".
If the summary has all the complete headings but are in the wrong order, penalize it.
If the summary has all the correct headings and they are in the right order, give it a perfect score.

Which in term looks something like this in code:

from deepeval.test_case import SingleTurnParams
from deepeval.metrics import GEval

metric = GEval(
    name="Format Correctness",
    evaluation_steps=[
        "The `actual_output` is completely wrong if it misses any of the headings: 'intro', 'body', 'conclusion'.",
        "If the `actual_output` has all the complete headings but are in the wrong order, penalize it.",
        "If the summary has all the correct headings and they are in the right order, give it a perfect score."
    ],
    evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT]
)

However, this will NOT give you the exact score according to your criteria, and is NOT as deterministic as you think. Instead, you can build a DAGMetric instead that gives deterministic scores based on the logic you've decided for your evaluation criteria.

You can still use GEval in the DAGMetric, but the DAGMetric will give you much greater control.

Required Arguments

To use the DAGMetric, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output

You'll also need to supply any additional arguments such as expected_output and tools_called if your evaluation criteria depends on these parameters.

Usage

The DAGMetric can be used to evaluate single-turn LLM interactions based on LLM-as-a-judge decision-trees.

from deepeval.metrics.dag import DeepAcyclicGraph
from deepeval.metrics import DAGMetric

dag = DeepAcyclicGraph(root_nodes=[...])

metric = DAGMetric(name="Instruction Following", dag=dag)

There are TWO mandatory and SIX optional parameters required when creating a DAGMetric:

name: name of the metric.
dag: a DeepAcyclicGraph which represents your evaluation decision tree. Here's how to create one.
[Optional] threshold: a float representing the minimum passing threshold. Defaulted to 0.5.
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to gpt-5.4.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
[Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
[Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.

Complete Walkthrough

In this walkthrough, we'll write a custom DAGMetric to see whether our LLM application has summarized meeting transcripts in the correct format. Let's say here are our criteria, in plain english:

The summary of meeting transcripts should contain the "intro", "body", and "conclusion" headings.
The summary of meeting transcripts should present the "into", "body", and "conclusion" headings in the correct order.

Here's the example LLMTestCase representing the transcript to be evaluated for formatting correctness:

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="""
Alice: "Today's agenda: product update, blockers, and marketing timeline. Bob, updates?"
Bob: "Core features are done, but we're optimizing performance for large datasets. Fixes by Friday, testing next week."
Alice: "Charlie, does this timeline work for marketing?"
Charlie: "We need finalized messaging by Monday."
Alice: "Bob, can we provide a stable version by then?"
Bob: "Yes, we'll share an early build."
Charlie: "Great, we'll start preparing assets."
Alice: "Plan: fixes by Friday, marketing prep Monday, sync next Wednesday. Thanks, everyone!"
""",
    actual_output="""
Intro:
Alice outlined the agenda: product updates, blockers, and marketing alignment.

Body:
Bob reported performance issues being optimized, with fixes expected by Friday. Charlie requested finalized messaging by Monday for marketing preparation. Bob confirmed an early stable build would be ready.

Conclusion:
The team aligned on next steps: engineering finalizing fixes, marketing preparing content, and a follow-up sync scheduled for Wednesday.
"""
)

Build Your Decision Tree

The DAGMetric requires you to first construct a decision tree that has direct edges and acyclic in nature. Let's take this decision tree for example:

We can see that the actual_output of an LLMTestCase is first processed to extract all headings, before deciding whether they are in the correct ordering. If they are not correct, we give it a score of 0, heavily penalizing it, whereas if it is correct, we check the degree of which they are in the correct ordering. Based on this "degree of correct ordering", we can then decide what score to assign it.

We can see that our decision tree involves four types of nodes:

TaskNodes: this node simply processes an LLMTestCase into the desired format for subsequent judgement.
BinaryJudgementNodes: this node will take in a criteria, and output a verdict of True/False based on whether that criteria has been met.
NonBinaryJudgementNodes: this node will also take in a criteria, but unlike the BinaryJudgementNode, the NonBinaryJudgementNode node have the ability to output a verdict other than True/False.
VerdictNodes: the VerdictNode is always a leaf node, and determines the final output score based on the evaluation path that was taken.

Putting everything into context, the TaskNode is the node that extracts summary headings from the actual_output, the BinaryJudgementNode is the node that determines if all headings are present, while the NonBinaryJudgementNode determines if they are in the correct order. The final score is determined by the four VerdictNodes.

Implement DAG In Code

Here's how this decision tree would look like in code:

from deepeval.test_case import SingleTurnParams
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)

correct_order_node = NonBinaryJudgementNode(
    criteria="Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?",
    children=[
        VerdictNode(verdict="Yes", score=10),
        VerdictNode(verdict="Two are out of order", score=4),
        VerdictNode(verdict="All out of order", score=2),
    ],
)

correct_headings_node = BinaryJudgementNode(
    criteria="Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?",
    children=[
        VerdictNode(verdict=False, score=0),
        VerdictNode(verdict=True, child=correct_order_node),
    ],
)

extract_headings_node = TaskNode(
    instructions="Extract all headings in `actual_output`",
    evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT],
    output_label="Summary headings",
    children=[correct_headings_node, correct_order_node],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])

When creating your DAG, there are three important points to remember:

There should only be an edge to a parent node if the current node depends on the output of the parent node.
All nodes, except for VerdictNodes, can have access to an LLMTestCase at any point in time.
All leaf nodes are VerdictNodes, but not all VerdictNodes are leaf nodes.

IMPORTANT: You'll see that in our example, extract_headings_node has correct_order_node as a child because correct_order_node's criteria depends on the extracted summary headings from the actual_output of the LLMTestCase.

Create Your `DAGMetric`

Now that you have your DAG, all that's left to do is to simply supply it when creating a DAGMetric:

from deepeval.metrics import DAGMetric

...
format_correctness = DAGMetric(name="Format Correctness", dag=dag)
format_correctness.measure(test_case)
print(format_correctness.score)

There are TWO mandatory and SIX optional parameters when creating a DAGMetric:

name: name of metric.
dag: a DeepAcyclicGraph which represents your evaluation decision tree.
[Optional] threshold: a float representing the minimum passing threshold. Defaulted to 0.5.
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to gpt-5.4.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
[Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
[Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.

Single-Turn Nodes

There are four node types that make up your deep acyclic graph. You'll be using these four node types to define a DAG, as follows:

from deepeval.metrics.dag import DeepAcyclicGraph

dag = DeepAcyclicGraph(root_nodes=...)

Here, root_nodes is a list of type TaskNode, BinaryJudgementNode, or NonBinaryJudgementNode. Let's go through all of them in more detail.

`TaskNode`

The TaskNode is designed specifically for processing data such as parameters from LLMTestCases, or even an output from a parent TaskNode. This allows for the breakdown of text into more atomic units that are better for evaluation.

from typing import Optional, List
from deepeval.metrics.dag import BaseNode
from deepeval.test_case import SingleTurnParams

class TaskNode(BaseNode):
    instructions: str
    output_label: str
    children: List[BaseNode]
    evaluation_params: Optional[List[SingleTurnParams]] = None
    label: Optional[str] = None

There are THREE mandatory and TWO optional parameter when creating a TaskNode:

instructions: a string specifying how to process parameters of an LLMTestCase, and/or outputs from a previous parent TaskNode.
output_label: a string representing the final output. The children BaseNodes will use the output_label to reference the output from the current TaskNode.
children: a list of BaseNodes. There must not be a VerdictNode in the list of children.
[Optional] evaluation_params: a list of type SingleTurnParams. Include only the parameters that are relevant for processing.
[Optional] label: a string that will be displayed in the verbose logs if verbose_mode is True.

`BinaryJudgementNode`

The BinaryJudgementNode determines whether the verdict is True or False based on the given criteria.

from typing import Optional, List
from deepeval.metrics.dag import BaseNode, VerdictNode
from deepeval.test_case import SingleTurnParams

class BinaryJudgementNode(BaseNode):
    criteria: str
    children: List[VerdictNode]
    evaluation_params: Optional[List[SingleTurnParams]] = None
    label: Optional[str] = None

There are TWO mandatory and TWO optional parameter when creating a BinaryJudgementNode:

criteria: a yes/no question based on output from parent node(s) and optionally parameters from the LLMTestCase. You DON'T HAVE TO TELL IT to output True or False.
children: a list of exactly two VerdictNodes, one with a verdict value of True, and the other with a value of False.
[Optional] evaluation_params: a list of type SingleTurnParams. Include only the parameters that are relevant for evaluation.
[Optional] label: a string that will be displayed in the verbose logs if verbose_mode is True.

`NonBinaryJudgementNode`

The NonBinaryJudgementNode determines what the verdict is based on the given criteria.

from typing import Optional, List
from deepeval.metrics.dag import BaseNode, VerdictNode
from deepeval.test_case import SingleTurnParams

class NonBinaryJudgementNode(BaseNode):
    criteria: str
    children: List[VerdictNode]
    evaluation_params: Optional[List[SingleTurnParams]] = None
    label: Optional[str] = None

There are TWO mandatory and TWO optional parameter when creating a NonBinaryJudgementNode:

criteria: an open-ended question based on output from parent node(s) and optionally parameters from the LLMTestCase. You DON'T HAVE TO TELL IT what to output.
children: a list of VerdictNodes, where the verdict values determine the possible verdict of the current NonBinaryJudgementNode.
[Optional] evaluation_params: a list of type SingleTurnParams. Include only the parameters that are relevant for evaluation.
[Optional] label: a string that will be displayed in the verbose logs if verbose_mode is True.

`VerdictNode`

The VerdictNode is always a leaf node and must not be the root node of your DAG. The verdict node contains no additional logic, and simply returns the determined score based on the specified verdict.

from typing import Union
from deepeval.metrics.dag import BaseNode
from deepeval.metrics import GEval

class VerdictNode(BaseNode):
    verdict: Union[str, bool]
    score: int
    child: Union[GEval, BaseNode]

There are ONE mandatory TWO optional parameters when creating a VerdictNode:

verdict: a string OR boolean representing the possible outcomes of the previous parent node. It must be a string if the parent is a NonBinaryJudgementNode, else boolean if the parent is a BinaryJudgementNode.
[Optional] score: a integer between 0 - 10 that determines the final score of your DAGMetric based on the specified verdict value. You must provide a score if g_eval is None.
[Optional] child: a BaseNode OR any BaseMetric, including GEval metric instances. If the score is not provided, the DAGMetric will use this provided child to run the provided BaseMetric instance to calculate a score, OR propagate the DAG execution to the BaseNode child.

How Is It Calculated?

The DAGMetric score is determined by traversing the custom decision tree in topological order, using any evaluation models along the way to perform judgements to determine which path to take.

On this page