GSM8K
The GSM8K benchmark comprises 1,319 grade school math word problems, each crafted by expert human problem writers. These problems involve elementary arithmetic operations (+ − ×÷) and require between 2 to 8 steps to solve. The dataset is designed to evaluate an LLM’s ability to perform multi-step mathematical reasoning. For more information, you can read the original GSM8K paper here.
Arguments
There are THREE optional arguments when using the GSM8K benchmark:
- [Optional]
n_problems: the number of problems for model evaluation. By default, this is set to 1319 (all problems in the benchmark). - [Optional]
n_shots: the number of "shots" to use for few-shot learning. This number ranges strictly from 0-3, and is set to 3 by default. - [Optional]
enable_cot: a boolean that determines if CoT prompting is used for evaluation. This is set toTrueby default.
Usage
The code below assesses a custom mistral_7b model (click here to learn how to use ANY custom LLM) on 10 problems in GSM8K using 3-shot CoT prompting.
from deepeval.benchmarks import GSM8K
# Define benchmark with n_problems and shots
benchmark = GSM8K(
n_problems=10,
n_shots=3,
enable_cot=True
)
# Replace 'mistral_7b' with your own custom model
benchmark.evaluate(model=mistral_7b)
print(benchmark.overall_score)The overall_score for this benchmark ranges from 0 to 1, where 1 signifies perfect performance and 0 indicates no correct answers. The model's score, based on exact matching, is calculated by determining the proportion of math word problems for which the model produces the precise correct answer number (e.g. '56') in relation to the total number of questions.
As a result, utilizing more few-shot prompts (n_shots) can greatly improve the model's robustness in generating answers in the exact correct format and boost the overall score.
FAQs
What does the GSM8K benchmark measure?
GSM8K evaluates an LLM's ability to perform multi-step mathematical reasoning. It comprises 1,319 grade school math word problems involving elementary arithmetic operations that require between 2 and 8 steps to solve.How many problems does GSM8K evaluate by default?
n_problems argument controls how many problems are used and defaults to 1319 (all problems in the benchmark). You can lower it, for example n_problems=10, to run a quicker evaluation.How is GSM8K scored?
overall_score ranges from 0 to 1 and is based on exact matching: it is the proportion of math word problems for which the model produces the precise correct answer number (e.g. '56'). See the benchmarks introduction for how scoring works across benchmarks.What is the default n_shots for GSM8K?
n_shots argument ranges strictly from 0-3 and is set to 3 by default. Using more few-shot prompts can improve the model's robustness in generating answers in the exact correct format.Does GSM8K support chain-of-thought prompting?
enable_cot argument is a boolean that determines whether CoT prompting is used, and it is set to True by default, prompting the model to articulate its reasoning before answering.How do I run GSM8K on a custom LLM?
DeepEvalBaseLLM, create a benchmark with GSM8K(n_problems=10, n_shots=3, enable_cot=True), and call benchmark.evaluate(model=mistral_7b) before reading benchmark.overall_score. See the benchmarking guide for using any custom LLM.