🔥 DeepEval 4.0 just got released. Read the announcement.
Available Benchmarks

ARC

ARC or AI2 Reasoning Challenge is a dataset used to benchmark language models' reasoning abilities. The benchmark consists of 8,000 multiple-choice questions from science exams for grades 3 to 9. The dataset includes two modes: easy and challenge, with the latter featuring more difficult questions that require advanced reasoning.

Arguments

There are THREE optional arguments when using the ARC benchmark:

  • [Optional] n_problems: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode.
  • [Optional] n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5.
  • [Optional] mode: a ARCMode enum that selects the evaluation mode. This is set to ARCMode.EASY by default. deepeval currently supports 2 modes: EASY and CHALLENGE.

Usage

The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 100 problems in ARC in EASY mode.

from deepeval.benchmarks import ARC
from deepeval.benchmarks.modes import ARCMode

# Define benchmark with specific n_problems and n_shots in easy mode
benchmark = ARC(
    n_problems=100,
    n_shots=3,
    mode=ARCMode.EASY
)

# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)

The overall_score ranges from 0 to 1, signifying the fraction of accurate predictions across tasks. Both modes' performances are measured using an exact match scorer, focusing on the quantity of correct answers.

FAQs

On this page