Question 1

What does the LAMBADA benchmark measure?

Accepted Answer

LAMBADA evaluates an LLM's ability to comprehend context and understand discourse. It includes 10,000 passages sourced from BooksCorpus, each requiring the model to predict the final word of a sentence. See the benchmarks introduction for how scoring works across benchmarks.

Question 2

How is LAMBADA scored?

Accepted Answer

Scoring is based on exact matching: the overall_score is the proportion of questions for which the model predicts the precise correct target word out of the total. It ranges from 0 to 1, where 1 signifies perfect performance.

Question 3

What is the default n_shots for LAMBADA?

Accepted Answer

It defaults to 5 and cannot exceed that. Use the n_shots argument to lower it for few-shot prompting.

Question 4

How many problems does LAMBADA evaluate by default?

Accepted Answer

By default n_problems is set to 5153, which evaluates all problems. You can pass a smaller value to n_problems to evaluate fewer problems.

Question 5

Why is LAMBADA considered a good comprehension benchmark?

Accepted Answer

The dataset is specifically designed so that humans cannot predict the final word of the last sentence without the preceding context, making it an effective benchmark for evaluating a model's broad comprehension.

Question 6

How do I run LAMBADA on a custom LLM?

Accepted Answer

Define the benchmark, then call evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.

LAMBADA

Arguments

Usage

FAQs

On this page