Question 1

What does the BoolQ benchmark measure?

Accepted Answer

BoolQ is a reading comprehension benchmark of 16K naturally occurring yes/no questions (3.3K in the validation set), each accompanied by a passage. The questions are generated in an unprompted setting. See the benchmarks introduction for how scoring works across benchmarks.

Question 2

How is BoolQ scored?

Accepted Answer

Scoring is based on exact matching: the overall_score is the proportion of questions for which the model produces the precise correct answer (i.e. 'Yes' or 'No') out of the total. It ranges from 0 to 1, where 1 signifies perfect performance.

Question 3

What is the default n_shots for BoolQ?

Accepted Answer

It defaults to 5 and cannot exceed that. Use the n_shots argument to lower it for few-shot prompting.

Question 4

How many problems does BoolQ evaluate by default?

Accepted Answer

By default n_problems is set to 3270, which evaluates all problems. You can pass a smaller value to n_problems to evaluate fewer problems.

Question 5

Can using more few-shot prompts improve the BoolQ score?

Accepted Answer

Yes. Utilizing more few-shot prompts via n_shots can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall_score.

Question 6

How do I run BoolQ on a custom LLM?

Accepted Answer

Define the benchmark, then call evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.

BoolQ

Arguments

Usage

FAQs

On this page