MMLU

Q: What does the MMLU benchmark measure?

MMLU (Massive Multitask Language Understanding) evaluates an LLM through multiple-choice questions spanning 57 subjects such as math, history, law, and ethics. It is good at detecting areas where a model may lack understanding in a particular topic.

Q: Which tasks can I run with MMLU?

You can pass any of the 57 MMLUTask enums via the tasks argument, for example HIGH_SCHOOL_COMPUTER_SCIENCE or ASTRONOMY. By default, deepeval evaluates your LLM on all 57 subject areas.

Q: How is MMLU scored?

The overall_score ranges from 0 to 1 and is based on exact matching: it is the proportion of multiple-choice questions for which the model produces the precise correct letter answer (e.g. 'A'). See the benchmarks introduction for how scoring works across benchmarks.

Q: What is the default n_shots for MMLU?

The n_shots argument defaults to 5 and cannot exceed that number. Using more few-shot prompts can improve the model's robustness in generating answers in the exact correct format and boost the overall score.

Q: Does MMLU support chain-of-thought prompting?

No. The MMLU benchmark exposes only two optional arguments, tasks and n_shots, and does not include an enable_cot option for chain-of-thought prompting.

Q: How do I run MMLU on a custom LLM?

Wrap your model in DeepEvalBaseLLM, create a benchmark with MMLU(tasks=[...], n_shots=3), and call benchmark.evaluate(model=mistral_7b). You can then read benchmark.overall_score. See the benchmarking guide for using any custom LLM.

MMLU (Massive Multitask Language Understanding) is a benchmark for evaluating LLMs through multiple-choice questions. These questions cover 57 subjects such as math, history, law, and ethics. For more information, visit the MMLU GitHub page.

Arguments

There are TWO optional arguments when using the MMLU benchmark:

[Optional] tasks: a list of tasks (MMLUTask enums), specifying which of the 57 subject areas to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the MMLUTask enum can be found here.
[Optional] n_shots: the number of "shots" to use for few-shot learning. This is set to 5 by default and cannot exceed this number.

Usage

The code below evaluates a custom mistral_7b model (click here to learn how to use ANY custom LLM) and assesses its performance on High School Computer Science and Astronomy using 3-shot learning.

from deepeval.benchmarks import MMLU
from deepeval.benchmarks.mmlu.task import MMLUTask

# Define benchmark with specific tasks and shots
benchmark = MMLU(
    tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE, MMLUTask.ASTRONOMY],
    n_shots=3
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of multiple-choice questions for which the model produces the precise correct letter answer (e.g. 'A') in relation to the total number of questions.

As a result, utilizing more few-shot prompts (n_shots) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.

MMLU Tasks

The MMLUTask enum classifies the diverse range of subject areas covered in the MMLU benchmark.

from deepeval.benchmarks.tasks import MMLUTask

mm_tasks = [MMLUTask.HIGH_SCHOOL_EUROPEAN_HISTORY]

Below is the comprehensive list of all available tasks:

HIGH_SCHOOL_EUROPEAN_HISTORY
BUSINESS_ETHICS
CLINICAL_KNOWLEDGE
MEDICAL_GENETICS
HIGH_SCHOOL_US_HISTORY
HIGH_SCHOOL_PHYSICS
HIGH_SCHOOL_WORLD_HISTORY
VIROLOGY
HIGH_SCHOOL_MICROECONOMICS
ECONOMETRICS
COLLEGE_COMPUTER_SCIENCE
HIGH_SCHOOL_BIOLOGY
ABSTRACT_ALGEBRA
PROFESSIONAL_ACCOUNTING
PHILOSOPHY
PROFESSIONAL_MEDICINE
NUTRITION
GLOBAL_FACTS
MACHINE_LEARNING
SECURITY_STUDIES
PUBLIC_RELATIONS
PROFESSIONAL_PSYCHOLOGY
PREHISTORY
ANATOMY
HUMAN_SEXUALITY
COLLEGE_MEDICINE
HIGH_SCHOOL_GOVERNMENT_AND_POLITICS
COLLEGE_CHEMISTRY
LOGICAL_FALLACIES
HIGH_SCHOOL_GEOGRAPHY
ELEMENTARY_MATHEMATICS
HUMAN_AGING
COLLEGE_MATHEMATICS
HIGH_SCHOOL_PSYCHOLOGY
FORMAL_LOGIC
HIGH_SCHOOL_STATISTICS
INTERNATIONAL_LAW
HIGH_SCHOOL_MATHEMATICS
HIGH_SCHOOL_COMPUTER_SCIENCE
CONCEPTUAL_PHYSICS
MISCELLANEOUS
HIGH_SCHOOL_CHEMISTRY
MARKETING
PROFESSIONAL_LAW
MANAGEMENT
COLLEGE_PHYSICS
JURISPRUDENCE
WORLD_RELIGIONS
SOCIOLOGY
US_FOREIGN_POLICY
HIGH_SCHOOL_MACROECONOMICS
COMPUTER_SECURITY
MORAL_SCENARIOS
MORAL_DISPUTES
ELECTRICAL_ENGINEERING
ASTRONOMY
COLLEGE_BIOLOGY

FAQs

What does the MMLU benchmark measure?

MMLU (Massive Multitask Language Understanding) evaluates an LLM through multiple-choice questions spanning 57 subjects such as math, history, law, and ethics. It is good at detecting areas where a model may lack understanding in a particular topic.

Which tasks can I run with MMLU?

You can pass any of the 57 MMLUTask enums via the tasks argument, for example HIGH_SCHOOL_COMPUTER_SCIENCE or ASTRONOMY. By default, deepeval evaluates your LLM on all 57 subject areas.

How is MMLU scored?

The overall_score ranges from 0 to 1 and is based on exact matching: it is the proportion of multiple-choice questions for which the model produces the precise correct letter answer (e.g. 'A'). See the benchmarks introduction for how scoring works across benchmarks.

What is the default n_shots for MMLU?

The n_shots argument defaults to 5 and cannot exceed that number. Using more few-shot prompts can improve the model's robustness in generating answers in the exact correct format and boost the overall score.

Does MMLU support chain-of-thought prompting?

No. The MMLU benchmark exposes only two optional arguments, tasks and n_shots, and does not include an enable_cot option for chain-of-thought prompting.

How do I run MMLU on a custom LLM?

Wrap your model in DeepEvalBaseLLM, create a benchmark with MMLU(tasks=[...], n_shots=3), and call benchmark.evaluate(model=mistral_7b). You can then read benchmark.overall_score. See the benchmarking guide for using any custom LLM.

Arguments

Usage

MMLU Tasks

FAQs

On this page