2023 ARBAdvancedReasoningBenchmarkfo
- (Sawada et al., 2023) ⇒ Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J Nay, Kshitij Gupta, and Aran Komatsuzaki. (2023). “ARB: Advanced Reasoning Benchmark for Large Language Models.” In: arXiv preprint arXiv:2307.13692. doi:10.48550/arXiv.2307.13692
Subject Headings: LLM Reasoning Benchmark.
Notes
Cited By
Quotes
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks. In order to improve both automatic and assisted evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. Further, we conduct a human evaluation of the symbolic subset of ARB, finding promising agreement between annotators and GPT-4 rubric evaluation scores.
1. Introduction
In recent years, models such as GPT-3 [Brown et al., 2020], GPT-4 [OpenAI, 2023], PaLM [Chowdh-ery et al., 2022], and Chinchilla [Hoffmann et al., 2022] have shown increasing performance across a wide variety of natural language tasks ranging from translation to reasoning [Bubeck et al., 2023, Laskar et al., 2023]. This rapid progress has been closely tracked and assessed by evaluating LLMs on benchmarks, which test model capabilities on a set of standardized problems. The GLUE benchmark [Wang et al., 2019b] for language understanding was first released in April 2018; but models such as BERT [Devlin et al., 2019] and GPT-2 [Radford et al., 2019] in the following year were already powerful enough to necessitate the “SuperGLUE” benchmark [Wang et al., 2019a]. Since then, the race between language models and benchmarks has increasingly favored the former.
Scaling up, model sizes and datasets alike, has led to rapid improvements on various natural language tasks on benchmarks like BIG-bench [Srivastava et al., 2022] and HELM [Liang et al., 2022]. Neural scaling laws [Kaplan et al., 2020, Caballero et al., 2023, Alabdulmohsin et al., 2022] have been used to predict the behavior of large scale models on various metrics. Nevertheless, LLM performance often increases unpredictably [Wei et al., 2022a], especially on tasks that require reasoning abilities. Predictions of performance on ML benchmarks often underestimate the rate of progress [Steinhardt, 2022]. Since progress has been faster than anticipated, new benchmarks need to be more difficult.
Models such as ChatGPT have shown the ability to pass entry-level examinations in fields such as law [Bommarito II and Katz, 2022], medicine [Kung et al., 2023], economics [Caplan, 2023], and mathematics [Shakarian et al., 2023]. Nevertheless, LLM understanding of many fields is reportedly shallow and unreliable [Shapira et al., 2023]. Expert reasoning in domains with specialized knowledge is essential for automated systems to augment skilled professionals [Noy and Zhang, 2023].
In this paper, we introduce a new benchmark dataset, ARB (Advanced Reasoning Benchmark), designed to evaluate expert reasoning abilities in mathematics, physics, chemistry, biology, and law. To make the benchmark more challenging than previous benchmarks, we extract graduate-level tasks from resources intended for domain professionals. The performance of current models such as GPT-4 on the quantitative parts of ARB is very low using standard prompting methods.
Our dataset offers improvements over existing benchmarks:
- Hundreds of problems requiring expert reasoning in quantitative subjects, where LLMs are known to underperform;
- A large percentage of the problems are short-answer and open response questions, in contrast to the multiple-choice questions that dominated earlier benchmarks.
In addition, we propose an automated rubric-based method allowing self-evaluation of intermediate reasoning steps. While not currently a substitute for human evaluation, rubrics generated by GPT-4 have good coverage, and self-evaluation scores track human grading surprisingly well. We provide the instructions to access the dataset in the supplementary material.
2 Related Work
Improving the reasoning capabilities of LLMs has been a subject of recent interest, with a particular focus on advanced prompting techniques [Wei et al., 2022b, Kojima et al., 2023, Wang et al., 2023, Yao et al., 2023, Nye et al., 2021]. Such techniques have seen increasingly successful applications in solving reasoning problems involving commonsense reasoning and mathematics, by promoting active reasoning processes within the LLMs before yielding final answers.
Model architectures such as Minerva [Lewkowycz et al., 2022] have exemplified the enhancement of reasoning capabilities through fine-tuning on extensive datasets covering math and reasoning tasks. This has yielded improved performance across several benchmarks, including MATH [Hendrycks et al., 2021], GSM8K [Cobbe et al., 2021], and MMLU [Hendrycks et al., 2020]. Concurrently, other lines of research [Li et al., 2023, Lightman et al., 2023, Cobbe et al., 2021] have investigated the application of verification techniques to augment and enhance LLM performance.
Most of the aforementioned work has typically evaluated techniques against math benchmarks (e.g., GSM8K [Cobbe et al., 2021], MATH [Hendrycks et al., 2021], SVAMP [Patel et al., 2021], ASDiv [Miao et al., 2020], AQuA [Ling et al., 2017], MAWPS [Koncel-Kedziorski et al., 2016], MultiArith [Roy and Roth, 2016]) and commonsense reasoning tasks (e.g., CSQA [Talmor et al., 2018], StrategyQA [Geva et al., 2021], HotpotQA [Yang et al., 2018]). Recently, several new benchmarks have been introduced for reasoning and planning tasks, such as the GPT-Planning Benchmark [Valmeekam et al., 2023], ALERT Reasoning Benchmark [Yu et al., 2022], JEEBench [Arora et al., 2023]), and [Gendron et al., 2023]. Additionally, comprehensive evaluation suites like the Chain-of-Thought Hub [Fu et al., 2023] have been proposed.
Despite their utility, existing benchmarks are limited in difficulty, represent a restricted range of reasoning challenges, and do not necessarily mirror real-world tasks demanding complex reasoning. Moreover, recent advancements such as Minerva [Lewkowycz et al., 2022] have revealed that these benchmarks may not offer sufficient challenge.
The rapid progress in LLM capabilities has led many to explore using LLMs in the LLM evaluation pipeline. Apart from using LLMs to generate evaluation tasks [Zhang et al., 2022, Perez et al., 2022], LLMs have increasingly been used as a proxy for human evaluation [Chiang and Lee, 2023, Liu et al., 2023, Fu et al., 2023, Kocmi and Federmann, 2023]. Useful LLM-based evaluation for alignment has been done using rubrics [Bai et al., 2022]. We explore the efficacy of rubrics for evaluation when applied to highly complex math and physics problems.
3 Benchmark
The key considerations when building a machine learning benchmark are:
- Difficulty. Most tasks have to be out of reach of current models; a benchmark where many models score over 95% is not useful for tracking differential AI development.
- Usefulness. The tested skills should correlate with generally useful human skills.
- Ease of evaluation. It should be straightforward for the model creators to compare the performances of different models. The scores should be interpretable.
- Minimizing data contamination. A consistent issue with popular benchmarks is that the recent LLMs contain some tasks in their training data [OpenAI, 2023]. This leads to overestimation of true model capabilities.
- Connection to general capabilities. If a model is trained on data similar to the benchmark, it is possible it achieves high performance without generalization or “intelligence”, failing to solve novel tasks of similar difficulty [Chollet, 2019]. Conversely, problems should not be pathological or overly adversarial, to avoid the dangers of underclaiming [Bowman, 2021].
3.1 Formatting
The benchmark consists of three types of questions: multiple choice, short answer, and open response, in descending order of proportion in the dataset.
- Multiple choice questions consist of a question and four to five possible answers, and the correct answer is the one that best answers the question. They were sourced from standardized tests, such as the MCAT and bar exam prep, and make up a large proportion of the dataset due to their ease of grading.
- Short answer questions, on the other hand, ask for final answers in the format of a short phrase or mathematical expression. They were sourced from problem books such as Souza and Silva [2008], Gelca and Andreescu [2017], and physics book series Lim and Qiang [2001], Lim [2007], Lim [1998], Lim et al. [2019], and Lim [1996]. We generally avoided algebraic expressions, because of technical difficulties in the grading process. A given algebraic expression may have several equivalent forms (e.g. nontrivial functional relations for the functions appearing in the final answer), and a grading scheme which accounts for all possible variations across our entire dataset is not feasible. Moreover, physics problems often require answers introducing new notation that is not explicitly mentioned in the problem statement.
- Open response questions are more challenging: they consist of a question and a blank space for the answer. They were sourced from problem books and exams, such as the Harvard PhD comprehensive exams in mathematics [Harvard University, 2021]. Such tasks require manual grading. These questions are aspirational in nature, as current systems (e.g. ChatGPT) cannot produce satisfactory responses, even for the “elementary” problems.
3.2 Mathematics
This part of the dataset is the most diverse. It includes contest mathematics problems as well as “university mathematics” (i.e. mathematics traditionally taught in universities at the undergraduate and beginning graduate level). The contest problems are sourced from Gelca and Andreescu [2017] and Brayman and Kukush [2018], and the university mathematics problems are sourced from Souza and Silva [2008] and Harvard University [2021]. The dataset does not include high school contest problems because those are already present in other well-known benchmarks [Hendrycks et al., 2021]. The Putnam and Brayman books both contain official solutions, which we also include in the dataset. This can be useful for fully automating the grading process, which we leave to future work. For university mathematics, we pick Souza and Silva [2008] for its large selection of “standard” undergraduate mathematics problems, as well as many problems suitable for the short answer portions. We also select Harvard University [2021] because it covers topics that other collections of exams rarely not cover, such as representation theory of finite groups and algebraic topology.
3.3 Physics
The physics problems are structured similarly as the math problems. The main difference is that some physics problems contain figures, and there are more problems with numerical answers. The problems were sourced from the Major American Universities PhD Qualifying Questions and Solutions series [Zhongguo-Kexue-Jishu-Daxue, 1990].
3.4 MCAT
The MCAT test contains multiple choice problems testing biology, psychology, chemistry, physics, and reading comprehension. The MCAT problems are sampled from the third edition of McGraw-Hill Education 3 MCAT Practice Tests [Campbell et al., 2017] and cover both science and reading questions. This book was chosen as very few of these problems appear in standard web-searchable sources, limiting contamination. As in the previous categories, we pick problems which are self-contained. Because some MCAT science questions are accompanied by images, we accompany such questions with corresponding image files.
3.5 Law
Applying law involves the application logical reasoning, in addition to grasping legal knowledge. This makes assessments of legal skills an especially attractive type of language model benchmark, where we are attempting to assess the reasoning and intelligence of these models. Furthermore, if the models better understand law, they can be more reliable and ultimately more useful in real-world applications, potentially even increasing the efficiency and transparency of governments more broadly.
Most lawyers in the U.S. go to law school, graduate, then study for the Bar Examination, and then must pass the bar before going on to practice law professionally. To evaluate legal understanding of the models, we use an older Bar Examination practice set that, to the best of our knowledge, is not available online in a way that could have led to its inclusion in training data for the language models that we are assessing. The practice bar exam we administer to the various language models covers most major areas of law and therefore it tests legal reasoning and broad U.S. legal knowledge.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 ARBAdvancedReasoningBenchmarkfo | Tomohiro Sawada Daniel Paleka Alexander Havrilla Pranav Tadepalli Paula Vidas Alexander Kranias John J Nay Kshitij Gupta Aran Komatsuzaki | ARB: Advanced Reasoning Benchmark for Large Language Models | 10.48550/arXiv.2307.13692 | 2023 |