LAMBADA
LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) evaluates an LLM's ability to comprehend context and understand discourse. This dataset includes 10,000 passages sourced from BooksCorpus, each requiring the LLM to predict the final word of a sentence. To explore the dataset in more detail, check out the original LAMBADA paper.
Arguments
There are TWO optional arguments when using the LAMBADA benchmark:
- [Optional]
n_problems: the number of problems for model evaluation. By default, this is set to 5153 (all problems). - [Optional]
n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
Usage
The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 10 problems in LAMBADA using 3-shot CoT prompting.
from deepeval.benchmarks import LAMBADA
# Define benchmark with n_problems and shots
benchmark = LAMBADA(
n_problems=10,
n_shots=3,
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model predicts the precise correct target word in relation to the total number of questions.
FAQs
What does the LAMBADA benchmark measure?
How is LAMBADA scored?
overall_score is the proportion of questions for which the model predicts the precise correct target word out of the total. It ranges from 0 to 1, where 1 signifies perfect performance.What is the default n_shots for LAMBADA?
5 and cannot exceed that. Use the n_shots argument to lower it for few-shot prompting.How many problems does LAMBADA evaluate by default?
n_problems is set to 5153, which evaluates all problems. You can pass a smaller value to n_problems to evaluate fewer problems.Why is LAMBADA considered a good comprehension benchmark?
How do I run LAMBADA on a custom LLM?
evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.