2022 ChainofThoughtPromptingElicitsR

(Wei, Wang et al., 2022) ⇒ Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. (2022). “Chain of Thought Prompting Elicits Reasoning in Large Language Models.” In: arXiv preprint arXiv:2201.11903. doi:10.48550/arXiv.2201.11903

Subject Headings: Chain-of-Though Prompting, Few-Shot Learning.

Notes

The paper introduces Chain-of-Thought (CoT) prompting, which improves the ability of large language models (LLMs) to perform complex reasoning tasks by generating a series of intermediate reasoning steps.
The paper demonstrates that CoT prompting outperforms standard prompting in arithmetic, commonsense, and symbolic reasoning tasks, showcasing significant gains in performance.
The paper highlights that CoT prompting is most effective when applied to LLMs with over 100 billion parameters, such as PaLM 540B and GPT-3 175B.
The paper shows that CoT prompting allows models to decompose multi-step problems, enhancing interpretability and enabling models to tackle more complex tasks.
The paper reports that CoT prompting led to state-of-the-art performance on benchmarks like GSM8K, particularly improving results for math word problems.
The paper also finds that CoT prompting provides interpretable reasoning paths, which can help identify where a model's reasoning might have gone wrong, aiding in model debugging.
The paper reveals that smaller models, with fewer parameters, do not benefit as much from CoT prompting, often generating fluent but illogical reasoning steps.
The paper conducts ablation studies, comparing CoT prompting with variations like "equation-only" and "reasoning after the answer," finding CoT to be the most effective approach.
The paper demonstrates the general applicability of CoT prompting beyond math, extending its benefits to commonsense reasoning, where models improved on datasets like CSQA and StrategyQA.
The paper discusses the emergence of CoT prompting as an ability that scales with model size, where larger models exhibit a stronger capability to generalize reasoning tasks using this method.
The paper suggests future directions for CoT prompting, including research into automating the generation of CoT exemplars to reduce manual effort, as well as improving reasoning accuracy.

Cited By

http://scholar.google.com/scholar?q=%222022%22+Chain+of+Thought+Prompting+Elicits+Reasoning+in+Large+Language+Models

2023

(Besta et al., 2023) ⇒ Maciej Besta, Nils Blach, ..., Piotr Nyczyk, and Torsten Hoefler. (2023). “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” In: arXiv preprint arXiv:2308.09687. doi:10.48550/arXiv.2308.09687
- QUOTE: ... in Chain-of-Thought (CoT), one introduces intermediate thoughts a1, a2, ... between x and y. This strategy was shown to significantly enhance various LM tasks over the plain IO baseline, such as mathematical puzzles [70] or general mathematical reasoning [24]. ...

Quotes

Abstract

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

1 Introduction

Math Word Problems (GSM8K)
Solve rate (%)
Finetuned GPT-3 175B
Prior best
PaLM 540B: standard prompting
PaLM 540B: chain-of-thought prompting

Figure 2: PaLM 540B uses chain-of-thought prompting to achieve new state-of-the-art performance on the GSM8K benchmark of math word problems. Finetuned GPT-3 and prior best are from Cobbe et al. (2021).

The NLP landscape has recently been revolutionized by language models (Peters et al., 2018; Devlin et al., 2019; Brown et al., 2020, inter alia). Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency (Kaplan et al., 2020; Brown et al., 2020, inter alia). However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning (Rae et al., 2021).

This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic reasoning can benefit from generating natural language rationales that lead to the final answer. Prior work has given models the ability to generate natural language intermediate steps by training from scratch (Ling et al., 2017) or finetuning a pretrained model (Cobbe et al., 2021), in addition to neuro-symbolic methods that use formal languages instead of natural language (Roy and Roth, 2015; Chiang and Chen, 2019; Amini et al., 2019; Chen et al., 2019). Second, large language models offer the exciting prospect of in-context few-shot learning via prompting. That is, instead of finetuning a separate language model checkpoint for each new task, one can simply “prompt” the model with a few input–output exemplars demonstrating the task. Remarkably, this has been successful for a range of simple question-answering tasks (Brown et al., 2020).

Figure 1: Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.

Both of the above ideas, however, have key limitations. For rationale-augmented training and finetuning methods, it is costly to create a large set of high quality rationales, which is much more complicated than simple input–output pairs used in normal machine learning. For the traditional fewshot prompting method used in Brown et al. (2020), it works poorly on tasks that require reasoning abilities, and often does not improve substantially with increasing language model scale (Rae et al., 2021). In this paper, we combine the strengths of these two ideas in a way that avoids their limitations. Specifically, we explore the ability of language models to perform few-shot prompting for reasoning tasks, given a prompt that consists of triples: hinput, chain of thought, outputi. A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output, and we refer to this approach as chain-of-thought prompting. An example prompt is shown in Figure 1.

We present empirical evaluations on arithmetic, commonsense, and symbolic reasoning benchmarks, showing that chain-of-thought prompting outperforms standard prompting, sometimes to a striking degree. Figure 2 illustrates one such result—on the GSM8K benchmark of math word problems (Cobbe et al., 2021), chain-of-thought prompting with PaLM 540B outperforms standard prompting by a large margin and achieves new state-of-the-art performance. A prompting only approach is important because it does not require a large training dataset and because a single model checkpoint can perform many tasks without loss of generality. This work underscores how large language models can learn via a few examples with natural language data about the task (c.f. automatically learning the patterns underlying inputs and outputs via a large training dataset).

2 Chain-of-Thought Prompting

Consider one’s own thought process when solving a complicated reasoning task such as a multi-step math word problem. It is typical to decompose the problem into intermediate steps and solve each before giving the final answer: “After Jane gives 2 flowers to her mom she has 10 : : : then after she gives 3 to her dad she will have 7 : : : so the answer is 7.” The goal of this paper is to endow language models with the ability to generate a similar chain of thought—a coherent series of intermediate reasoning steps that lead to the final answer for a problem. We will show that sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting.

Figure 1 shows an example of a model producing a chain of thought to solve a math word problem that it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it mimics a step-by-step thought process for arriving at the answer (and also, solutions/explanations typically come after the final answer (Narang et al., 2020; Wiegreffe et al., 2022; Lampinen et al., 2022, inter alia)).

Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models.

1. First, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps.

2. Second, a chain of thought provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (although fully characterizing a model’s computations that support an answer remains an open question).

3. Third, chain-of-thought reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) to any task that humans can solve via language.

4. Finally, chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.

In empirical experiments, we will observe the utility of chain-of-thought prompting for arithmetic reasoning (Section 3), commonsense reasoning (Section 4), and symbolic reasoning (Section 5).

3 Arithmetic Reasoning

We begin by considering math word problems of the form in Figure 1, which measure the arithmetic reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chainof- thought prompting when used with the 540B parameter language model performs comparably with task-specific finetuned models on several tasks, even achieving new state of the art on the challenging GSM8K benchmark (Cobbe et al., 2021).

...

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2022 ChainofThoughtPromptingElicitsR	Dale Schuurmans Xuezhi Wang Quoc Le Maarten Bosma Jason Wei Denny Zhou Ed Chi			Chain of Thought Prompting Elicits Reasoning in Large Language Models				10.48550/arXiv.2201.11903		2022