🔥 DeepEval 4.0 just got released. Read the announcement.
Available Benchmarks

BBQ

BBQ, or the Bias Benchmark of QA, evaluates an LLM's ability to generate unbiased responses across various attested social biases. It consists of 58K unique trinary choice questions spanning various bias categories, such as age, race, gender, religion, and more. You can read more about the BBQ benchmark and its construction in this paper.

Arguments

There are TWO optional arguments when using the BBQ benchmark:

  • [Optional] tasks: a list of tasks (BBQTask enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of BBQTask enums can be found here.
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.

Usage

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on age and gender-related biases using 3-shot prompting.

from deepeval.benchmarks import BBQ
from deepeval.benchmarks.tasks import BBQTask

# Define benchmark with specific tasks and shots
benchmark = BBQ(
    tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY],
    n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of questions for which the model produces the precise correct multiple choice answer (e.g. 'A' or ‘C’) in relation to the total number of questions.

BBQ Tasks

The BBQTask enum classifies the diverse range of reasoning categories covered in the BBQ benchmark.

from deepeval.benchmarks.tasks import BBQTask

math_qa_tasks = [BBQTask.AGE]

Below is the comprehensive list of available tasks:

  • AGE
  • DISABILITY_STATUS
  • GENDER_IDENTITY
  • NATIONALITY
  • PHYSICAL_APPEARANCE
  • RACE_ETHNICITY
  • RACE_X_SES
  • RACE_X_GENDER
  • RELIGION
  • SES
  • SEXUAL_ORIENTATION

FAQs

What does the BBQ benchmark measure?
BBQ, the Bias Benchmark of QA, evaluates an LLM's ability to generate unbiased responses across various attested social biases. It consists of 58K unique trinary-choice questions spanning categories such as age, race, gender, and religion. See the benchmarks introduction for how scoring works across benchmarks.
How does BBQ evaluate bias?
BBQ assesses responses at two levels: how they reflect social biases given insufficient context, and whether the model's bias overrides the correct choice given sufficient context.
How is BBQ scored?
Scoring is based on exact matching: the overall_score is the proportion of questions for which the model produces the precise correct multiple-choice answer (e.g. 'A' or 'C') out of the total. It ranges from 0 to 1, where 1 signifies perfect performance.
What is the default n_shots for BBQ?
It defaults to 5 and cannot exceed that. Use the n_shots argument to lower it for few-shot prompting.
Which bias categories can I evaluate with BBQ?
Pass a list of BBQTask enums to the tasks argument to target categories such as AGE, GENDER_IDENTITY, RELIGION, and RACE_ETHNICITY. By default, all tasks are evaluated.
How do I run BBQ on a custom LLM?
Define the benchmark, then call evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.

On this page