Multi-Turn
Tool Use
LLM-as-a-judge
Multi-turn
Referenceless
Agent
Multimodal
The Tool Use metric is a multi-turn agentic metric that evaluates whether your LLM agent's tool selection and argument generation capablilities. It is a self-explaining eval, which means it outputs a reason for its metric score.
Required Arguments
To use the ToolUseMetric, you'll have to provide the following arguments when creating a ConversationalTestCase:
turns
You can learn more about how it is calculated here.
Usage
The ToolUseMetric() can be used for end-to-end multi-turn evaluations of agents.
from deepeval import evaluate
from deepeval.metrics import ToolUseMetric
from deepeval.test_case import Turn, ConversationalTestCase, ToolCall
convo_test_case = ConversationalTestCase(
turns=[
Turn(role="...", content="..."),
Turn(role="...", content="...", tools_called=[...])
],
)
metric = ToolUseMetric(threshold=0.5)
# To run metric as a standalone
# metric.measure(convo_test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[convo_test_case], metrics=[metric])There is ONE mandatory and SIX optional parameters when creating a ToolUseMetric:
available_tools: a list ofToolCalls that give context on all the tools that were available to your LLM agent. This list is used to evaluate your agent's tool selection capability.- [Optional]
threshold: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-5.4. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse.
As a standalone
You can also run the ToolUseMetric on a single test case as a standalone, one-off execution.
...
metric.measure(convo_test_case)
print(metric.score, metric.reason)How Is It Calculated
The ToolUseMetric score is determined through the following process:
- Compute the Tool Selection Score for each unit interaction.
- Compute the Argument Correctness Score for all unit interactions that include tool calls.
- The Tool Selection Score evaluates whether the agent chose the most appropriate tool for the task among all the available tools.
- The Argument Correctness Score assesses whether the arguments provided in the tool call were accurate and suitable for the task. This score is only considered when a tool call has been made.
FAQs
How do I check whether my agent called the right tools across a conversation?
ToolUseMetric evaluates each interaction's tool selection and then the arguments passed, flagging both wrong-tool turns and right-tool-but-wrong-arguments turns.Why do I have to pass available_tools if the agent already called tools?
available_tools is the full menu of ToolCalls the agent could have used. Without the alternatives, the metric can't tell whether a better tool existed or whether a tool should have been called at all.My agent picked the right tool but with wrong arguments — does that fail?
Yes. The final score is
min(ToolSelectionScore, ArgumentCorrectnessScore), so a perfect tool choice with bad arguments still drags it down. The argument score only applies to turns where a tool was called.Does it catch unnecessary or missing tool calls?
Both hurt the Tool Selection Score: calling an unneeded tool or skipping a needed one scores as poor selection against
available_tools. Use verbose_mode for per-interaction reasoning.