GSM8K (Grade School Math 8K) Benchmark
Jump to navigation
Jump to search
A GSM8K (Grade School Math 8K) Benchmark is a mathematical reasoning benchmark with linguistically diverse grade school math word problems.
- Context:
- It can (typically) contain 8,500 math word problems, divided into 7,500 training problems and 1,000 test problems, designed to test multi-step problem-solving abilities.
- It can (often) involve basic arithmetic operations such as Addition, Subtraction, Multiplication, and Division, requiring between 2 to 8 steps to solve.
- ...
- It can serve as a benchmark to assess the mathematical reasoning of LLMs.
- It can support research in improving AI multi-step reasoning, especially for natural language mathematical problems.
- It can implement techniques like Chain-of-Thought Prompting to assist models in generating intermediate steps for better problem-solving.
- It can challenge AI systems, as even the latest transformer models struggle to achieve high accuracy, highlighting limitations in current model architectures.
- ...
- Example(s):
- ...
- Counter-Example(s):
- GSM-Symbolic.
- Simple Arithmetic Datasets, which do not require multi-step reasoning or linguistic processing.
- See: Chain-of-Thought Prompting, Tree-of-Thought Prompting, Transformer Models, Benchmark Datasets.
References
2024
- (Mirzadeh et al., 2024) ⇒ Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. (2024). “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.” In: arXiv preprint arXiv:2410.05229.
- NOTES
- The paper introduces GSM-Symbolic, a new benchmark for evaluating mathematical reasoning in LLMs, addressing limitations of the GSM8K benchmark.
- NOTES