Turn Faithfulness
The turn faithfulness metric is a conversational metric that determines whether your LLM chatbot generates factually accurate responses grounded in the retrieval context throughout a conversation.
Required Arguments
To use the TurnFaithfulnessMetric, you'll have to provide the following arguments when creating a ConversationalTestCase:
turns
You must provide the role, content, and retrieval_context for evaluation to happen. Read the How Is It Calculated section below to learn more.
Usage
The TurnFaithfulnessMetric() can be used for end-to-end multi-turn evaluation:
from deepeval import evaluate
from deepeval.test_case import Turn, ConversationalTestCase
from deepeval.metrics import TurnFaithfulnessMetric
convo_test_case = ConversationalTestCase(
turns=[
Turn(role="user", content="...", retrieval_context=["..."]),
Turn(role="assistant", content="...", retrieval_context=["..."])
]
)
metric = TurnFaithfulnessMetric(threshold=0.5)
# To run metric as a standalone
# metric.measure(convo_test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[convo_test_case], metrics=[metric])There are NINE optional parameters when creating a TurnFaithfulnessMetric:
- [Optional]
threshold: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-5.4. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse. - [Optional]
truths_extraction_limit: an optional integer to limit the number of truths extracted from retrieval context per document. Defaulted toNone. - [Optional]
penalize_ambiguous_claims: a boolean which when set toTrue, penalizes claims that cannot be verified as true or false. Defaulted toFalse. - [Optional]
window_size: an integer which defines the size of the sliding window of turns used during evaluation. Defaulted to10.
As a standalone
You can also run the TurnFaithfulnessMetric on a single test case as a standalone, one-off execution.
...
metric.measure(convo_test_case)
print(metric.score, metric.reason)How Is It Calculated?
The TurnFaithfulnessMetric score is calculated according to the following equation:
The TurnFaithfulnessMetric first constructs a sliding windows of turns. For each window, it:
- Extracts truths from the retrieval context provided in the turns
- Generates claims from the assistant's responses in the interaction
- Evaluates verdicts by checking if each claim contradicts the truths
- Calculates the interaction score as the ratio of faithful claims to total claims
The final score is the average of all interaction faithfulness scores across the conversation.