Question 1

What does the BBQ benchmark measure?

Accepted Answer

BBQ, the Bias Benchmark of QA, evaluates an LLM's ability to generate unbiased responses across various attested social biases. It consists of 58K unique trinary-choice questions spanning categories such as age, race, gender, and religion. See the benchmarks introduction for how scoring works across benchmarks.

Question 2

How does BBQ evaluate bias?

Accepted Answer

BBQ assesses responses at two levels: how they reflect social biases given insufficient context, and whether the model's bias overrides the correct choice given sufficient context.

Question 3

How is BBQ scored?

Accepted Answer

Scoring is based on exact matching: the overall_score is the proportion of questions for which the model produces the precise correct multiple-choice answer (e.g. 'A' or 'C') out of the total. It ranges from 0 to 1, where 1 signifies perfect performance.

Question 4

What is the default n_shots for BBQ?

Accepted Answer

It defaults to 5 and cannot exceed that. Use the n_shots argument to lower it for few-shot prompting.

Question 5

Which bias categories can I evaluate with BBQ?

Accepted Answer

Pass a list of BBQTask enums to the tasks argument to target categories such as AGE, GENDER_IDENTITY, RELIGION, and RACE_ETHNICITY. By default, all tasks are evaluated.

Question 6

How do I run BBQ on a custom LLM?

Accepted Answer

Define the benchmark, then call evaluate() with your model and read overall_score. See benchmarking your LLM to learn how to use any custom LLM.

BBQ

Arguments

Usage

BBQ Tasks

FAQs

On this page