Agentic
Plan Adherence
LLM-as-a-judge
Single-turn
Referenceless
Agent
Multimodal
The Plan Adherence metric is an agentic metric that extracts the task and plan from your agent's trace which are then used to evaluate how well your agent has adhered to the plan in completing the task. It is a self-explaining eval, which means it outputs a reason for its metric score.
Usage
To begin, set up tracing and simply supply the PlanAdherenceMetric() to your agent's @observe tag or in the evals_iterator method.
from somewhere import llm
from deepeval.tracing import observe, update_current_trace
from deepeval.dataset import Golden, EvaluationDataset
from deepeval.metrics import PlanAdherenceMetric
from deepeval.test_case import ToolCall
@observe
def tool_call(input):
...
return [ToolCall(name="CheckWhether")]
@observe
def agent(input):
tools = tool_call(input)
output = llm(input, tools)
update_current_trace(
input=input,
output=output,
tools_called=tools
)
return output
# Create dataset
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather like in SF?")])
# Initialize metric
metric = PlanAdherenceMetric(threshold=0.7, model="gpt-4o")
# Loop through dataset
for golden in dataset.evals_iterator(metrics=[metric]):
agent(golden.input)There are SEVEN optional parameters when creating a PlanAdherenceMetric:
- [Optional]
threshold: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-5.4. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse.
To learn more about how the evals_iterator work, click here.
How Is It Calculated?
The PlanAdherenceMetric score is calculated by following these steps:
- Extract Task from the trace, this defines the user's goal or intent for the agent and is actionable.
- Extract Plan from the trace, a plan is extracted from the agent's
thinkingorreasoning. If there are no statements that clearly define or imply a plan from the trace, the metric passes by default with a score of1. - Evaluate the agent's execution steps from the trace and see how accurately the agent has adhered to the plan.
- The Alignment Score uses an LLM to generate the final score with all the pre-processed and extracted information like plan, task and execution steps.
FAQs
My agent made a good plan but didn't follow it — which metric catches that?
Plan Adherence. It scores the
AlignmentScore between (Task, Plan) and the actual Execution Steps, so an ignored plan scores low. Plan Quality would still rate that plan highly, since it judges only the plan.How is Plan Adherence different from Plan Quality?
Plan Adherence is process — did execution follow the plan. Plan Quality is the plan — was it good in the first place. Run both to tell a bad plan from a good plan poorly executed.
Plan Adherence vs Task Completion — following the plan vs getting it done?
Right. Plan Adherence checks if the agent stuck to its plan; Task Completion checks if it achieved the outcome. An agent can deviate yet finish, or follow the plan and fail — different failure modes.
What happens if my agent's trace has no plan?
The plan comes from the agent's
thinking or reasoning. With none to extract, there's nothing to adhere to and the metric passes by default with 1 — an unexpected perfect score usually means your trace isn't surfacing reasoning.Can I use Plan Adherence with LangChain, OpenAI, or another framework?
Yes.
deepeval auto-traces agents built with LangChain, OpenAI, LlamaIndex, CrewAI, and more — see all framework integrations.