Mathematical Reasoning Benchmark

Context:
- It can (often) cover various topics, from basic arithmetic to advanced subjects such as algebra, calculus, and probability.
- ...
- It can test model capabilities through programmatically generated datasets.
- It can assess the limits of model performance on diverse tasks.
- ...
Example(s):
- GSM8K Benchmark, which evaluates multi-step arithmetic problem-solving skills of large language models.
- MATH Benchmark, which challenges models with advanced mathematics competition problems across a range of topics.
- AQuA Benchmark, which tests algebraic reasoning by converting word problems into solvable equations.
- BIG-bench, which includes tasks designed to probe mathematical reasoning alongside other areas.
- ...
Counter-Example(s):
- Logical Problem-Solving Benchmark.
See: ..., ....

References