GSM8K (Grade School Math 8K) Benchmark

Context:
- It can (typically) contain 8,500 math word problems, divided into 7,500 training problems and 1,000 test problems, designed to test multi-step problem-solving abilities.
- It can (often) involve basic arithmetic operations such as Addition, Subtraction, Multiplication, and Division, requiring between 2 to 8 steps to solve.
- ...
- It can serve as a benchmark to assess the mathematical reasoning of LLMs.
- It can support research in improving AI multi-step reasoning, especially for natural language mathematical problems.
- It can implement techniques like Chain-of-Thought Prompting to assist models in generating intermediate steps for better problem-solving.
- It can challenge AI systems, as even the latest transformer models struggle to achieve high accuracy, highlighting limitations in current model architectures.
- ...
Example(s):
- ...
Counter-Example(s):
- GSM-Symbolic.
- Simple Arithmetic Datasets, which do not require multi-step reasoning or linguistic processing.
See: Chain-of-Thought Prompting, Tree-of-Thought Prompting, Transformer Models, Benchmark Datasets.

References