Question 1

What does the GSM8K benchmark measure?

Accepted Answer

GSM8K evaluates an LLM's ability to perform multi-step mathematical reasoning. It comprises 1,319 grade school math word problems involving elementary arithmetic operations that require between 2 and 8 steps to solve.

Question 2

How many problems does GSM8K evaluate by default?

Accepted Answer

The n_problems argument controls how many problems are used and defaults to 1319 (all problems in the benchmark). You can lower it, for example n_problems=10, to run a quicker evaluation.

Question 3

How is GSM8K scored?

Accepted Answer

The overall_score ranges from 0 to 1 and is based on exact matching: it is the proportion of math word problems for which the model produces the precise correct answer number (e.g. '56'). See the benchmarks introduction for how scoring works across benchmarks.

Question 4

What is the default n_shots for GSM8K?

Accepted Answer

The n_shots argument ranges strictly from 0-3 and is set to 3 by default. Using more few-shot prompts can improve the model's robustness in generating answers in the exact correct format.

Question 5

Does GSM8K support chain-of-thought prompting?

Accepted Answer

Yes. The enable_cot argument is a boolean that determines whether CoT prompting is used, and it is set to True by default, prompting the model to articulate its reasoning before answering.

Question 6

How do I run GSM8K on a custom LLM?

Accepted Answer

Wrap your model in DeepEvalBaseLLM, create a benchmark with GSM8K(n_problems=10, n_shots=3, enable_cot=True), and call benchmark.evaluate(model=mistral_7b) before reading benchmark.overall_score. See the benchmarking guide for using any custom LLM.

GSM8K

Arguments

Usage

FAQs

On this page