Multi-Turn
Tool Use
LLM-as-a-judge
Multi-turn
Referenceless
Agent
Multimodal
The Tool Use metric is a multi-turn agentic metric that evaluates whether your LLM agent's tool selection and argument generation capablilities. It is a self-explaining eval, which means it outputs a reason for its metric score.
Required Arguments
To use the ToolUseMetric, you'll have to provide the following arguments when creating a ConversationalTestCase:
turns
You can learn more about how it is calculated here.
Usage
The ToolUseMetric() can be used for end-to-end multi-turn evaluations of agents.
from deepeval import evaluate
from deepeval.metrics import ToolUseMetric
from deepeval.test_case import Turn, ConversationalTestCase, ToolCall
convo_test_case = ConversationalTestCase(
turns=[
Turn(role="...", content="..."),
Turn(role="...", content="...", tools_called=[...])
],
)
metric = ToolUseMetric(threshold=0.5)
# To run metric as a standalone
# metric.measure(convo_test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[convo_test_case], metrics=[metric])There is ONE mandatory and SIX optional parameters when creating a ToolUseMetric:
available_tools: a list ofToolCalls that give context on all the tools that were available to your LLM agent. This list is used to evaluate your agent's tool selection capability.- [Optional]
threshold: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-5.4. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse.
As a standalone
You can also run the ToolUseMetric on a single test case as a standalone, one-off execution.
...
metric.measure(convo_test_case)
print(metric.score, metric.reason)How Is It Calculated
The ToolUseMetric score is determined through the following process:
- Compute the Tool Selection Score for each unit interaction.
- Compute the Argument Correctness Score for all unit interactions that include tool calls.
- The Tool Selection Score evaluates whether the agent chose the most appropriate tool for the task among all the available tools.
- The Argument Correctness Score assesses whether the arguments provided in the tool call were accurate and suitable for the task. This score is only considered when a tool call has been made.