🔥 DeepEval 4.0 just got released. Read the announcement.
Available Benchmarks

ARC

ARC or AI2 Reasoning Challenge is a dataset used to benchmark language models' reasoning abilities. The benchmark consists of 8,000 multiple-choice questions from science exams for grades 3 to 9. The dataset includes two modes: easy and challenge, with the latter featuring more difficult questions that require advanced reasoning.

Arguments

There are THREE optional arguments when using the ARC benchmark:

  • [Optional] n_problems: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode.
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
  • [Optional] mode: a ARCMode enum that selects the evaluation mode. This is set to ARCMode.EASY by default. deepeval currently supports 2 modes: EASY and CHALLENGE.

Usage

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 100 problems in ARC in EASY mode.

from deepeval.benchmarks import ARC
from deepeval.benchmarks.modes import ARCMode

# Define benchmark with specific n_problems and n_shots in easy mode
benchmark = ARC(
    n_problems=100,
    n_shots=3,
    mode=ARCMode.EASY
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score ranges from 0 to 1, signifying the fraction of accurate predictions across tasks. Both modes' performances are measured using an exact match scorer, focusing on the quantity of correct answers.

FAQs

What does the ARC benchmark measure?
ARC (AI2 Reasoning Challenge) benchmarks an LLM's reasoning abilities using 8,000 multiple-choice questions drawn from science exams for grades 3 to 9.
What are the EASY and CHALLENGE modes?
The mode argument takes an ARCMode enum and defaults to ARCMode.EASY. Both EASY and CHALLENGE consist of multiple-choice questions, but CHALLENGE questions are more difficult and require more advanced reasoning.
How is ARC scored?
The overall_score ranges from 0 to 1 and represents the fraction of accurate predictions across tasks. Both modes are measured using an exact match scorer focused on the quantity of correct answers. See the benchmarks introduction for how scoring works across benchmarks.
What is the default n_shots for ARC?
The n_shots argument is set to 5 by default and cannot exceed 5. It controls the number of examples used for few-shot learning.
How many problems does ARC evaluate by default?
The n_problems argument defaults to all problems available in each benchmark mode. You can lower it, for example n_problems=100, to run a faster evaluation.
How do I run ARC on a custom LLM?
Wrap your model in DeepEvalBaseLLM, create a benchmark with ARC(n_problems=100, n_shots=3, mode=ARCMode.EASY), and call benchmark.evaluate(model=mistral_7b) before reading benchmark.overall_score. See the benchmarking guide for using any custom LLM.

On this page