Contextual Precision
The contextual precision metric uses LLM-as-a-judge to measure your RAG pipeline's retriever by evaluating whether nodes in your retrieval_context that are relevant to the given input are ranked higher than irrelevant ones. deepeval's contextual precision metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.
Required Arguments
To use the ContextualPrecisionMetric, you'll have to provide the following arguments when creating an LLMTestCase:
inputactual_outputexpected_outputretrieval_context
Read the How Is It Calculated section below to learn how test case parameters are used for metric calculation.
Usage
The ContextualPrecisionMetric() can be used for end-to-end evaluation of text-based and multimodal test cases:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import ContextualPrecisionMetric
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."
# Replace this with the expected output of your RAG generator
expected_output = "You are eligible for a 30 day full refund at no extra cost."
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]
metric = ContextualPrecisionMetric(
threshold=0.7,
model="gpt-4.1",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
expected_output=expected_output,
retrieval_context=retrieval_context
)
# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[test_case], metrics=[metric])from deepeval import evaluate
from deepeval.test_case import LLMTestCase, MLLMImage
from deepeval.metrics import ContextualPrecisionMetric
# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = [
f"The Eiffel Tower {MLLMImage(...)} is a wrought-iron lattice tower built in the late 19th century.",
f"...",
]
metric = ContextualPrecisionMetric(
threshold=0.7,
model="gpt-4.1",
include_reason=True
)
test_case = LLMTestCase(
input=f"Tell me about this landmark in France: {MLLMImage(...)}",
actual_output=f"This appears to be Eiffel Tower, which is a famous landmark in France"
expected_output=f"The Eiffel Tower is located in Paris, France. {MLLMImage(...)}",
retrieval_context=retrieval_context
)
# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[test_case], metrics=[metric])There are SEVEN optional parameters when creating a ContextualPrecisionMetric:
- [Optional]
threshold: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM. Defaulted togpt-5.4. - [Optional]
include_reason: a boolean which when set toTrue, will include a reason for its evaluation score. Defaulted toTrue. - [Optional]
strict_mode: a boolean which when set toTrue, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse. - [Optional]
async_mode: a boolean which when set toTrue, enables concurrent execution within themeasure()method. Defaulted toTrue. - [Optional]
verbose_mode: a boolean which when set toTrue, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse. - [Optional]
evaluation_template: a class of typeContextualPrecisionTemplate, which allows you to override the default prompts used to compute theContextualPrecisionMetricscore. Defaulted todeepeval'sContextualPrecisionTemplate.
Within components
You can also run the ContextualPrecisionMetric within nested components for component-level evaluation.
from deepeval.dataset import Golden
from deepeval.tracing import observe, update_current_span
...
@observe(metrics=[metric])
def inner_component():
# Set test case at runtime
test_case = LLMTestCase(input="...", actual_output="...")
update_current_span(test_case=test_case)
return
@observe
def llm_app(input: str):
# Component can be anything from an LLM call, retrieval, agent, tool use, etc.
inner_component()
return
evaluate(observed_callback=llm_app, goldens=[Golden(input="Hi!")])As a standalone
You can also run the ContextualPrecisionMetric on a single test case as a standalone, one-off execution.
...
metric.measure(test_case)
print(metric.score, metric.reason)How Is It Calculated?
The ContextualPrecisionMetric score is calculated according to the following equation:
The ContextualPrecisionMetric first uses an LLM to determine for each node in the retrieval_context whether it is relevant to the input based on information in the expected_output, before calculating the weighted cumulative precision as the contextual precision score. The weighted cumulative precision (WCP) is used because it:
- Emphasizes on Top Results: WCP places a stronger emphasis on the relevance of top-ranked results. This emphasis is important because LLMs tend to give more attention to earlier nodes in the
retrieval_context(which may cause downstream hallucination if nodes are ranked incorrectly). - Rewards Relevant Ordering: WCP can handle varying degrees of relevance (e.g., "highly relevant", "somewhat relevant", "not relevant"). This is in contrast to metrics like precision, which treats all retrieved nodes as equally important.
A higher contextual precision score represents a greater ability of the retrieval system to correctly rank relevant nodes higher in the retrieval_context.
Customize Your Template
Since deepeval's ContextualPrecisionMetric is evaluated by LLM-as-a-judge, you can likely improve your metric accuracy by overriding deepeval's default prompt templates. This is especially helpful if:
- You're using a custom evaluation LLM, especially for smaller models that have weaker instruction following capabilities.
- You want to customize the examples used in the default
ContextualPrecisionTemplateto better align with your expectations.
Here's a quick example of how you can override the statement generation step of the ContextualPrecisionMetric algorithm:
from deepeval.metrics import ContextualPrecisionTemplate
from deepeval.metrics.contextual_precision import ContextualPrecisionTemplate
# Define custom template
class CustomTemplate(ContextualPrecisionTemplate):
@staticmethod
def generate_verdicts(
input: str, expected_output: str, retrieval_context: List[str]
):
return f"""Given the input, expected output, and retrieval context, please generate a list of JSON objects to determine whether each node in the retrieval context was remotely useful in arriving at the expected output.
Example JSON:
{{
"verdicts": [
{{
"verdict": "yes",
"reason": "..."
}}
]
}}
The number of 'verdicts' SHOULD BE STRICTLY EQUAL to that of the contexts.
**
Input:
{input}
Expected output:
{expected_output}
Retrieval Context:
{retrieval_context}
JSON:
"""
# Inject custom template to metric
metric = ContextualPrecisionMetric(evaluation_template=CustomTemplate)
metric.measure(...)