DAG (Deep Acyclic Graph)
The deep acyclic graph (DAG) metric in deepeval is currently the most versatile custom metric for you to easily build deterministic decision trees for evaluation with the help of using LLM-as-a-judge.
The DAGMetric gives you more deterministic control over GEval. You can however also use GEval, or any other default metric in deepeval, within your DAGMetric.

Should I use DAG or G-Eval?
If you were to do this using GEval, your evaluation_steps might look something like this:
- The summary is completely wrong if it misses any of the headings: "intro", "body", "conclusion".
- If the summary has all the complete headings but are in the wrong order, penalize it.
- If the summary has all the correct headings and they are in the right order, give it a perfect score.
Which in term looks something like this in code:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval
metric = GEval(
name="Format Correctness",
evaluation_steps=[
"The `actual_output` is completely wrong if it misses any of the headings: 'intro', 'body', 'conclusion'.",
"If the `actual_output` has all the complete headings but are in the wrong order, penalize it.",
"If the summary has all the correct headings and they are in the right order, give it a perfect score."
],
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)
However, this will NOT give you the exact score according to your criteria, and is NOT as deterministic as you think. Instead, you can build a DAGMetric instead that gives deterministic scores based on the logic you've decided for your evaluation criteria.
You can still use GEval in the DAGMetric, but the DAGMetric will give you much greater control.
Required Arguments
To use the DAGMetric, you'll have to provide the following arguments when creating an LLMTestCase:
inputactual_output
You'll also need to supply any additional arguments such as expected_output and tools_called if your evaluation criteria depends on these parameters.
Usage
The DAGMetric can be used to evaluate single-turn LLM interactions based on LLM-as-a-judge decision-trees.
from deepeval.dag import DeepAcyclicGraph
from deepeval.metrics import DAGMetric
dag = DeepAcyclicGraph(root_nodes=[...])
metric = DAGMetric(name="Instruction Following", dag=dag)
There are TWO mandatory and SIX optional parameters required when creating a DAGMetric:
name: name of the metric.dag: aDeepAcyclicGraphwhich represents your evaluation decision tree. Here's how to create one.- [Optional]
threshold: a float representing the minimum passing threshold. Defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted to 'gpt-4.1'. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse.
Complete Walkthrough
In this walkthrough, we'll write a custom DAGMetric to see whether our LLM application has summarized meeting transcripts in the correct format. Let's say here are our criteria, in plain english:
- The summary of meeting transcripts should contain the "intro", "body", and "conclusion" headings.
- The summary of meeting transcripts should present the "into", "body", and "conclusion" headings in the correct order.
Here's the example LLMTestCase representing the transcript to be evaluated for formatting correctness:
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="""
Alice: "Today's agenda: product update, blockers, and marketing timeline. Bob, updates?"
Bob: "Core features are done, but we're optimizing performance for large datasets. Fixes by Friday, testing next week."
Alice: "Charlie, does this timeline work for marketing?"
Charlie: "We need finalized messaging by Monday."
Alice: "Bob, can we provide a stable version by then?"
Bob: "Yes, we'll share an early build."
Charlie: "Great, we'll start preparing assets."
Alice: "Plan: fixes by Friday, marketing prep Monday, sync next Wednesday. Thanks, everyone!"
""",
actual_output="""
Intro:
Alice outlined the agenda: product updates, blockers, and marketing alignment.
Body:
Bob reported performance issues being optimized, with fixes expected by Friday. Charlie requested finalized messaging by Monday for marketing preparation. Bob confirmed an early stable build would be ready.
Conclusion:
The team aligned on next steps: engineering finalizing fixes, marketing preparing content, and a follow-up sync scheduled for Wednesday.
"""
)
Build Your Decision Tree
The DAGMetric requires you to first construct a decision tree that has direct edges and acyclic in nature. Let's take this decision tree for example:
We can see that the actual_output of an LLMTestCase is first processed to extract all headings, before deciding whether they are in the correct ordering. If they are not correct, we give it a score of 0, heavily penalizing it, whereas if it is correct, we check the degree of which they are in the correct ordering. Based on this "degree of correct ordering", we can then decide what score to assign it.
The LLMTestCase we're showing symbolizes all nodes can get access to an LLMTestCase at any point in the DAG, but in this example only the first node that extracts all the headings from the actual_output needed the LLMTestCase.
We can see that our decision tree involves four types of nodes:
TaskNodes: this node simply processes anLLMTestCaseinto the desired format for subsequent judgement.BinaryJudgementNodes: this node will take in acriteria, and output a verdict ofTrue/Falsebased on whether that criteria has been met.NonBinaryJudgementNodes: this node will also take in acriteria, but unlike theBinaryJudgementNode, theNonBinaryJudgementNodenode have the ability to output a verdict other thanTrue/False.VerdictNodes: theVerdictNodeis always a leaf node, and determines the final output score based on the evaluation path that was taken.
Putting everything into context, the TaskNode is the node that extracts summary headings from the actual_output, the BinaryJudgementNode is the node that determines if all headings are present, while the NonBinaryJudgementNode determines if they are in the correct order. The final score is determined by the four VerdictNodes.
Some might be skeptical if this complexity is necessary but in reality, you'll quickly realize that the more processing you do, the more deterministic your evaluation gets. You can of course combine the correctness and ordering of the summary headings in one step, but as your criteria gets more complicated, your evaluation model is likely to hallucinate more and more.
Implement DAG In Code
Here's how this decision tree would look like in code:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics.dag import (
DeepAcyclicGraph,
TaskNode,
BinaryJudgementNode,
NonBinaryJudgementNode,
VerdictNode,
)
correct_order_node = NonBinaryJudgementNode(
criteria="Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?",
children=[
VerdictNode(verdict="Yes", score=10),
VerdictNode(verdict="Two are out of order", score=4),
VerdictNode(verdict="All out of order", score=2),
],
)
correct_headings_node = BinaryJudgementNode(
criteria="Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?",
children=[
VerdictNode(verdict=False, score=0),
VerdictNode(verdict=True, child=correct_order_node),
],
)
extract_headings_node = TaskNode(
instructions="Extract all headings in `actual_output`",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
output_label="Summary headings",
children=[correct_headings_node, correct_order_node],
)
# create the DAG
dag = DeepAcyclicGraph(root_nodes=[extract_headings_node])
When creating your DAG, there are three important points to remember:
- There should only be an edge to a parent node if the current node depends on the output of the parent node.
- All nodes, except for
VerdictNodes, can have access to anLLMTestCaseat any point in time. - All leaf nodes are
VerdictNodes, but not allVerdictNodes are leaf nodes.
IMPORTANT: You'll see that in our example, extract_headings_node has correct_order_node as a child because correct_order_node's criteria depends on the extracted summary headings from the actual_output of the LLMTestCase.
To make creating a DAGMetric easier, you should aim to start by sketching out all the criteria and different paths your evaluation can take.