🔥 DeepEval 4.0 just got released. Read the announcement.
Available Benchmarks

LAMBADA

LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) evaluates an LLM's ability to comprehend context and understand discourse. This dataset includes 10,000 passages sourced from BooksCorpus, each requiring the LLM to predict the final word of a sentence. To explore the dataset in more detail, check out the original LAMBADA paper.

Arguments

There are TWO optional arguments when using the LAMBADA benchmark:

  • [Optional] n_problems: the number of problems for model evaluation. By default, this is set to 5153 (all problems).
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.

Usage

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 10 problems in LAMBADA using 3-shot CoT prompting.

from deepeval.benchmarks import LAMBADA

# Define benchmark with n_problems and shots
benchmark = LAMBADA(
    n_problems=10,
    n_shots=3,
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model predicts the precise correct target word in relation to the total number of questions.

FAQs

What does the LAMBADA benchmark measure?
LAMBADA evaluates an LLM's ability to comprehend context and understand discourse. It includes 10,000 passages sourced from BooksCorpus, each requiring the model to predict the final word of a sentence. See the benchmarks introduction for how scoring works across benchmarks.
How is LAMBADA scored?
Scoring is based on exact matching: the overall_score is the proportion of questions for which the model predicts the precise correct target word out of the total. It ranges from 0 to 1, where 1 signifies perfect performance.
What is the default n_shots for LAMBADA?
It defaults to 5 and cannot exceed that. Use the n_shots argument to lower it for few-shot prompting.
How many problems does LAMBADA evaluate by default?
By default n_problems is set to 5153, which evaluates all problems. You can pass a smaller value to n_problems to evaluate fewer problems.
Why is LAMBADA considered a good comprehension benchmark?
The dataset is specifically designed so that humans cannot predict the final word of the last sentence without the preceding context, making it an effective benchmark for evaluating a model's broad comprehension.
How do I run LAMBADA on a custom LLM?
Define the benchmark, then call evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.

On this page