Task Completion

LLM-as-a-judge

Referenceless metric

Agent metric

The task completion metric uses LLM-as-a-judge to evaluate how effectively an LLM agent accomplishes a task. Task Completion is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

info

Task Completion analyzes your agent's full trace to determine task success, which requires setting up tracing.

Usage

To begin, set up tracing and simply supply the TaskCompletionMetric() to your agent's @observe tag.

from deepeval.metrics import TaskCompletionMetric
from deepeval.tracing import observe
from deepeval.dataset import Golden
from deepeval import evaluate

task_completion = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4o",
    include_reason=True
)

@observe(metrics=[task_completion])
def trip_planner_agent(input):
    destination = "Paris"
    days = 2

    @observe()
    def restaurant_finder(city):
        return ["Le Jules Verne", "Angelina Paris", "Septime"]

    @observe()
    def itinerary_generator(destination, days):
        return ["Eiffel Tower", "Louvre Museum", "Montmartre"][:days]

    itinerary = itinerary_generator(destination, days)
    restaurants = restaurant_finder(destination)

    output = []
    for i in range(days):
        output.append(f"{itinerary[i]} and eat at {restaurants[i]}")

    return ". ".join(output) + "."

evaluate(observed_callback=trip_planner_agent, goldens=[Golden(input="Paris, 2")])

There are SEVEN optional parameters when creating a TaskCompletionMetric:

[Optional] threshold: a float representing the minimum passing threshold, defaulted to 0.5.
[Optional] task: a string representing the task to be completed. If no task is supplied, it is automatically inferred from the trace. Defaulted to the None
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
[Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
[Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.

End-to-End

You can also run the TaskCompletionMetric for end-to-end evaluation as a standalone, though this is not the recommended way to use it, since the full trace is required for thorough evaluation.

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import TaskCompletionMetric
from deepeval.test_case import ToolCall

metric = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4.1",
    include_reason=True
)
test_case = LLMTestCase(
    input="Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.",
    actual_output=(
        "Day 1: Eiffel Tower, dinner at Le Jules Verne. "
        "Day 2: Louvre Museum, lunch at Angelina Paris. "
        "Day 3: Montmartre, evening at a wine bar."
    ),
    tools_called=[
        ToolCall(
            name="Itinerary Generator",
            description="Creates travel plans based on destination and duration.",
            input_parameters={"destination": "Paris", "days": 3},
            output=[
                "Day 1: Eiffel Tower, Le Jules Verne.",
                "Day 2: Louvre Museum, Angelina Paris.",
                "Day 3: Montmartre, wine bar.",
            ],
        ),
        ToolCall(
            name="Restaurant Finder",
            description="Finds top restaurants in a city.",
            input_parameters={"city": "Paris"},
            output=["Le Jules Verne", "Angelina Paris", "local wine bars"],
        ),
    ],
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

To use the TaskCompletionMetric for end-to-end evaluation, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output
tools_called

The input and actual_output are required to create an LLMTestCase (and hence required by all metrics) even though they might not be used for metric calculation. Read the How Is It Calculated section below to learn more.

caution

This is not recommended and will be deprecated soon as a test case does not represent the full execution of an agent.

How Is It Calculated?

The TaskCompletionMetric score is calculated according to the following equation:

\text{Task Completion Score} = \text{AlignmentScore}(\text{Task}, \text{Outcome})

Task and Outcome are extracted from the trace (or test case for end-to-end) using an LLM.
The Alignment Score measures how well the outcome aligns with the extracted (or user-provided) task, as judged by an LLM.

Usage​

End-to-End​

How Is It Calculated?​

Usage

End-to-End

How Is It Calculated?