Tool Correctness
The tool correctness metric is an agentic LLM metric that assesses your LLM agent's function/tool calling ability. It is calculated by comparing whether every tool that is expected to be used was indeed called and if the selection of the tools made by the LLM agent were the most optimal.
The ToolCorrectnessMetric allows you to define the strictness of correctness. By default, it considers matching tool names to be correct, but you can also require input parameters and output to match.
Required Arguments
To use the ToolCorrectnessMetric, you'll have to provide the following arguments when creating an LLMTestCase:
inputactual_outputtools_calledexpected_tools
Read the How Is It Calculated section below to learn how test case parameters are used for metric calculation.
Usage
The ToolCorrectnessMetric() can be used for end-to-end evaluation:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
# Replace this with the tools that was actually used by your LLM agent
tools_called=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")],
expected_tools=[ToolCall(name="WebSearch")],
)
metric = ToolCorrectnessMetric()
# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[test_case], metrics=[metric])
There are EIGHT optional parameters when creating a ToolCorrectnessMetric:
- [Optional]
available_tools: a list ofToolCalls that give context on all the tools that were available to your LLM agent. This list is used to evaluate your agent's tool selection capability. - [Optional]
threshold: a float representing the minimum passing threshold, defaulted to 0.5. [Optional]evaluation_params: A list ofToolCallParamsindicating the strictness of the correctness criteria, available options areToolCallParams.INPUT_PARAMETERSandToolCallParams.OUTPUT. For example, supplying a list containingToolCallParams.INPUT_PARAMETERSbut excludingToolCallParams.OUTPUT, will deem a tool correct if the tool name and input parameters match, even if the output does not. Defaults to a an empty list. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse. - [Optional]
should_consider_ordering: a boolean which when set toTrue, will consider the ordering in which the tools were called in. For example, ifexpected_tools=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery"), ToolCall(name="WebSearch")]andtools_called=[ToolCall(name="WebSearch"), ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")], the metric will consider the tool calling to be correct. Only available forToolCallParams.TOOLand defaulted toFalse. - [Optional]
should_exact_match: a boolean which when set toTrue, will required thetools_calledandexpected_toolsto be exactly the same. Available forToolCallParams.TOOLandToolCallParams.INPUT_PARAMETERSand Defaulted toFalse.
Since should_exact_match is a stricter criteria than should_consider_ordering, setting should_consider_ordering will have no effect when should_exact_match is set to True.
Within components
You can also run the ToolCorrectnessMetric within nested components for component-level evaluation.
from deepeval.dataset import Golden
from deepeval.tracing import observe, update_current_span
...
@observe(metrics=[metric])
def inner_component():
# Set test case at runtime
test_case = LLMTestCase(input="...", actual_output="...")
update_current_span(test_case=test_case)
return
@observe
def llm_app(input: str):
# Component can be anything from an LLM call, retrieval, agent, tool use, etc.
inner_component()
return
evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])
As a standalone
You can also run the ToolCorrectnessMetric on a single test case as a standalone, one-off execution.
...
metric.measure(test_case)
print(metric.score, metric.reason)
This is great for debugging or if you wish to build your own evaluation pipeline, but you will NOT get the benefits (testing reports, Confident AI platform) and all the optimizations (speed, caching, computation) the evaluate() function or deepeval test run offers.
How Is It Calculated?
The ToolCorrectnessMetric, unlike all other deepeval metrics, uses both deterministic and non-deterministic evaluation to give a final score. It uses tools_called, expected_tools and available_tools to find the final score.
The tool correctness metric score is calculated using the following steps:
- Find the deterministic score for
tools_calledusing theexpected_toolsusing the following equation:
- This metric assesses the accuracy of your agent's tool usage by comparing the
tools_calledby your LLM agent to the list ofexpected_tools. A score of 1 indicates that every tool utilized by your LLM agent were called correctly according to the list ofexpected_tools,should_consider_ordering, andshould_exact_match, while a score of 0 signifies that none of thetools_calledwere called correctly.
If exact_match is not specified and ToolCall.INPUT_PARAMETERS is included in evaluation_params, correctness may be a percentage score based on the proportion of correct input parameters (assuming the name and output are correct, if applicable).
- If the
available_toolsare provided, theToolCorrectnessMetricalso uses an LLM to find whether thetools_calledwere the most optimal for the given task using theavailable_toolsas reference. The final score is the minimum of both scores. Ifavailable_toolsis not provided, the LLM-based evaluation does not take place.