Question 1

What does the ARC benchmark measure?

Accepted Answer

ARC (AI2 Reasoning Challenge) benchmarks an LLM's reasoning abilities using 8,000 multiple-choice questions drawn from science exams for grades 3 to 9.

Question 2

What are the EASY and CHALLENGE modes?

Accepted Answer

The mode argument takes an ARCMode enum and defaults to ARCMode.EASY. Both EASY and CHALLENGE consist of multiple-choice questions, but CHALLENGE questions are more difficult and require more advanced reasoning.

Question 3

How is ARC scored?

Accepted Answer

The overall_score ranges from 0 to 1 and represents the fraction of accurate predictions across tasks. Both modes are measured using an exact match scorer focused on the quantity of correct answers. See the benchmarks introduction for how scoring works across benchmarks.

Question 4

What is the default n_shots for ARC?

Accepted Answer

The n_shots argument is set to 5 by default and cannot exceed 5. It controls the number of examples used for few-shot learning.

Question 5

How many problems does ARC evaluate by default?

Accepted Answer

The n_problems argument defaults to all problems available in each benchmark mode. You can lower it, for example n_problems=100, to run a faster evaluation.

Question 6

How do I run ARC on a custom LLM?

Accepted Answer

Wrap your model in DeepEvalBaseLLM, create a benchmark with ARC(n_problems=100, n_shots=3, mode=ARCMode.EASY), and call benchmark.evaluate(model=mistral_7b) before reading benchmark.overall_score. See the benchmarking guide for using any custom LLM.

ARC

Arguments

Usage

FAQs

On this page