Available Benchmarks
ARC
ARC or AI2 Reasoning Challenge is a dataset used to benchmark language models' reasoning abilities. The benchmark consists of 8,000 multiple-choice questions from science exams for grades 3 to 9. The dataset includes two modes: easy and challenge, with the latter featuring more difficult questions that require advanced reasoning.
Arguments
There are THREE optional arguments when using the ARC benchmark:
- [Optional]
n_problems: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode. - [Optional]
n_shots: the number of examples for few-shot learning. This is set to 5 by default and cannot exceed 5. - [Optional] mode: a
ARCModeenum that selects the evaluation mode. This is set toARCMode.EASYby default.deepevalcurrently supports 2 modes: EASY and CHALLENGE.
Usage
The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 100 problems in ARC in EASY mode.
from deepeval.benchmarks import ARC
from deepeval.benchmarks.modes import ARCMode
# Define benchmark with specific n_problems and n_shots in easy mode
benchmark = ARC(
n_problems=100,
n_shots=3,
mode=ARCMode.EASY
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)The overall_score ranges from 0 to 1, signifying the fraction of accurate predictions across tasks. Both modes' performances are measured using an exact match scorer, focusing on the quantity of correct answers.