🔥 DeepEval 4.0 just got released. Read the announcement.
Available Benchmarks

BoolQ

BoolQ is a reading comprehension dataset containing 16K yes/no questions (3.3K in the validation set). BoolQ features naturally occurring questions, meaning they are generated in an unprompted setting, with each question accompanied by a passage.

Arguments

There are TWO optional arguments when using the BoolQ benchmark:

  • [Optional] n_problems: the number of problems for model evaluation. By default, this is set to 3270 (all problems).
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.

Usage

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 10 problems in BoolQ using 3-shot CoT prompting.

from deepeval.benchmarks import BoolQ

# Define benchmark with n_problems and shots
benchmark = BoolQ(
    n_problems=10,
    n_shots=3,
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct answer (i.e. 'Yes' or 'No') in relation to the total number of questions.

FAQs

What does the BoolQ benchmark measure?
BoolQ is a reading comprehension benchmark of 16K naturally occurring yes/no questions (3.3K in the validation set), each accompanied by a passage. The questions are generated in an unprompted setting. See the benchmarks introduction for how scoring works across benchmarks.
How is BoolQ scored?
Scoring is based on exact matching: the overall_score is the proportion of questions for which the model produces the precise correct answer (i.e. 'Yes' or 'No') out of the total. It ranges from 0 to 1, where 1 signifies perfect performance.
What is the default n_shots for BoolQ?
It defaults to 5 and cannot exceed that. Use the n_shots argument to lower it for few-shot prompting.
How many problems does BoolQ evaluate by default?
By default n_problems is set to 3270, which evaluates all problems. You can pass a smaller value to n_problems to evaluate fewer problems.
Can using more few-shot prompts improve the BoolQ score?
Yes. Utilizing more few-shot prompts via n_shots can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall_score.
How do I run BoolQ on a custom LLM?
Define the benchmark, then call evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.

On this page