Multi-Turn
Goal Accuracy
LLM-as-a-judge
Multi-turn
Referenceless
Agent
Multimodal
The Goal Accuracy metric is a multi-turn agentic metric that evaluates your LLM agent's abilities on planning and executing the plan to finish a task or reach a goal. It is a self-explaining eval, which means it outputs a reason for its metric score.
Required Arguments
To use the GoalAccuracyMetric, you'll have to provide the following arguments when creating a ConversationalTestCase:
turns
You can learn more about how it is calculated here.
Usage
The GoalAccuracyMetric() can be used for end-to-end multi-turn evaluations of agents.
from deepeval import evaluate
from deepeval.metrics import GoalAccuracyMetric
from deepeval.test_case import Turn, ConversationalTestCase, ToolCall
convo_test_case = ConversationalTestCase(
turns=[
Turn(role="...", content="..."),
Turn(role="...", content="...", tools_called=[...])
],
)
metric = GoalAccuracyMetric(threshold=0.5)
# To run metric as a standalone
# metric.measure(convo_test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[convo_test_case], metrics=[metric])There are SIX optional parameters when creating a GoalAccuracyMetric:
- [Optional]
threshold: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-5.4. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse.
As a standalone
You can also run the GoalAccuracyMetric on a single test case as a standalone, one-off execution.
...
metric.measure(convo_test_case)
print(metric.score, metric.reason)How Is It Calculated
The GoalAccuracyMetric score is calculated using the following steps:
- Find individual goals and steps taken by your LLM agent for each user-assistat interactions.
- Find goal accuracy scores for each of the goal-steps pairs using the evaluation model.
- Find plan quality and plan adherence scores for each of the goal-step pairs using the evaluation model.
FAQs
My agent answered every message but still failed the user's actual goal — will this catch it?
Yes.
GoalAccuracyMetric extracts the underlying task from the user's messages and judges whether the agent's plan and steps actually reached it. A conversation can look responsive turn by turn yet score low if the goal was never accomplished.What's the difference between the goal score and the plan score?
The final score averages a goal evaluation score (did it reach the goal) and a plan evaluation score (plan quality and adherence). The average reflects both reaching the goal and how well it was planned.
How does it know the goal if I never pass an expected outcome?
It infers the goal from the
"user" messages, then evaluates the steps taken to satisfy it. You only supply turns on the ConversationalTestCase.Does it account for tool calls when scoring the plan?
Yes. Turns can include
tools_called, and the metric factors tool usage into the plan it reconstructs and how well that plan reached the goal.