Citation Faithfulness
The citation faithfulness metric uses LLM-as-a-judge to check whether every [N] citation marker in your RAG pipeline's actual_output points to the passage in retrieval_context that actually supports the specific claim the marker is attached to. deepeval's citation faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.
Required Arguments
To use the CitationFaithfulnessMetric, you'll have to provide the following arguments when creating an LLMTestCase:
inputactual_outputretrieval_context
The retrieval_context passages are numbered ([1], [2], ...) before being shown to the judge, and the [N] markers in actual_output refer to those passage numbers.
Usage
The CitationFaithfulnessMetric() can be used for end-to-end evaluation of text-based test cases:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics.community import CitationFaithfulnessMetric
# Replace this with the actual output from your LLM application.
# The completion-year claim is cited to passage [1], which only covers height.
actual_output = "The Eiffel Tower was completed in 1889 [1]."
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = [
"The Eiffel Tower stands 330 metres tall in Paris.",
"The Eiffel Tower was completed in 1889 for the World Fair.",
]
metric = CitationFaithfulnessMetric()
test_case = LLMTestCase(
input="How tall is the Eiffel Tower and when was it completed?",
actual_output=actual_output,
retrieval_context=retrieval_context,
)
# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[test_case], metrics=[metric])In the example above the FaithfulnessMetric would pass the answer, because the completion-year claim is supported by passage [2]. The CitationFaithfulnessMetric fails it, because the [1] citation points to the height passage rather than the passage that supports the claim.
There are seven optional parameters when creating a CitationFaithfulnessMetric:
- [Optional]
threshold: a float representing the minimum passing score, defaulted to1.0. The score is1.0for a faithful answer and0.0for an unfaithful one, so the default requires a faithful verdict to pass. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-4.1. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score:0for perfection,1otherwise. It also overrides the current threshold and sets it to1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console. Defaulted toFalse.
How Is It Calculated?
The CitationFaithfulnessMetric score is calculated according to the following equation:
A single LLM judge reads the question, the numbered passages, and the answer, then returns a binary verdict:
faithful(score1.0): every factual claim is supported by the passages, and every[N]citation marker points to a passage that supports the claim it is attached to.unfaithful(score0.0): at least one claim is unsupported, contradicts a passage, or carries a citation[N]where passageNdoes not support that claim, even if some other passage would.