2023 SparksofArtificialGeneralIntell
- (Bubeck et al., 2023) ⇒ Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. (2023). “Sparks of Artificial General Intelligence: Early Experiments with GPT-4.” In: arXiv:2303.12712 [cs.CL]
Subject Headings: Artificial General Intelligence (AGI), AGI Component, Theory of Mind Measure, Multimodel Learning, GPT-4.
Notes
Cited By
Quotes
Abstract
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.
Contents
1 Introduction 4 1.1 Our approach to studying GPT-4's intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Multimodal and interdisciplinary composition 13 2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Image generation following detailed instructions ( a la Dall-E) . . . . . . . . . . . . . . 17 2.2.3 Possible application in sketch generation . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Coding 21 3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Coding challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Real world scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Mathematical abilities 30 4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 A rst generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . 34 4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Mathematical modeling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Higher level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 Interaction with the world 43 5.1 Tool use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Using multiple tools to solve more complex tasks . . . . . . . . . . . . . . . . . . . . . 44 5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Embodied Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 Warmup: navigating a map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.2 Text-based games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.3 Real world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6 Interaction with humans 54 6.1 Understanding Humans: Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1.1 Testing speci c aspects of theory of mind . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1.2 Testing theory of mind in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Talking to Humans: Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7 Discriminative Capabilities 69 7.1 PII Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2 Misconceptions and Fact-Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.2.1 Why Are Current Metrics Insu cient? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2.2 GPT-4 as a Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8 Limitations of autoregressive architecture highlighted by GPT-4 76 8.1 Warm-up with two basic examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.2 Lack of planning in arithmetic/reasoning problems . . . . . . . . . . . . . . . . . . . . . . . . 77 8.3 Lack of planning in text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9 Societal influences 82 9.1 Challenges of erroneous generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 9.2 Misinformation and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 9.4 Human expertise, jobs, and economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 9.5 Constellation of influences and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 10 Directions and Conclusions 92 10.1 De nitions of intelligence, AI, and AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 10.2 On the path to more general arti cial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 93 10.3 What is actually happening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A GPT-4 has common sense grounding 101 B Appendix for multimodal and interdisciplinary composition 105 B.1 Further details on integrative ability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 B.2 Further details on vision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.3 Graphic novel design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 C Appendix for the Coding section 111 C.1 Measuring human performance on LeetCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.2 Example of GPT-4 visualizing IMDb data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 C.3 More examples on visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 C.4 Example for 2D HTML game development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 C.5 Example for graphical user interface programming . . . . . . . . . . . . . . . . . . . . . . . . 116 C.6 Example for reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.7 Testing GPT-4's ability to execute (pseudo) code . . . . . . . . . . . . . . . . . . . . .
1 Introduction
Intelligence is a multifaceted and elusive concept that has long challenged psychologists, philosophers, and computer scientists. An attempt to capture its essence was made in 1994 by a group of 52 psychologists who signed onto a broad definition published in an editorial about the science of intelligence Got97. The consensus group defined intelligence as a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. This definition implies that intelligence is not limited to a specific domain or task, but rather encompasses a broad range of cognitive skills and abilities. Building an artificial system that exhibits the kind of general intelligence captured by the 1994 consensus definition is a long-standing and ambitious goal of AI research. In early writings, the founders of the modern discipline of artificial intelligence (AI) research called out sets of aspirational goals for understanding intelligence MMRS06. Over decades, AI researchers have pursued principles of intelligence, including generalizable mechanisms for reasoning (e.g., NSS59, LBFL93) and construction of knowledge bases containing large corpora of commonsense knowledge Len95. However, many of the more recent successes in AI research can be described as being narrowly focused on well-defined tasks and challenges, such as playing chess or Go, which were mastered by AI systems in 1996 and 2016, respectively. In the late-1990s and into the 2000s, there were increasing calls for developing more general AI systems (e.g., SBD+96) and scholarship in the field has sought to identify principles that might underly more generally intelligent systems (e.g., Leg08, GHT15). The phrase, "artificial general intelligence" (AGI), was popularized in the early-2000s (see Goe14) to emphasize the aspiration of moving from the "narrow AI", as demonstrated in the focused, real-world applications being developed, to broader notions of intelligence, harkening back to the long-term aspirations and dreams of earlier AI research. We use AGI to refer to systems that demonstrate broad capabilities of intelligence as captured in the 1994 definition above, with the additional requirement, perhaps implicit in the work of the consensus group, that these capabilities are at or above human-level. We note however that there is no single definition of AGI that is broadly accepted, and we discuss other definitions in the conclusion section.
The most remarkable breakthrough in AI research of the last few years has been the advancement of natural language processing achieved by large language models (LLMs). These neural network models are based on the Transformer architecture VSP+17 and trained on massive corpora of web-text data, using at its core a self-supervised objective of predicting the next word in a partial sentence. In this paper, we report on evidence that a new LLM developed by OpenAI, which is an early and non-multimodal version of GPT-4 Ope23, exhibits many traits of intelligence, according to the 1994 definition. Despite being purely a language model, this early version of GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks, including abstraction, comprehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and more. We interacted with GPT-4 during its early development by OpenAI using purely natural language queries (prompts)1. In Figure 1.1, we display some preliminary examples of outputs from GPT-4, asking it to write a proof of in nitude of primes in the form of a poem, to draw a unicorn in TiKZ (a language for creating graphics in LATEX), to create a complex animation in Python, and to solve a high-school level mathematical problem. It easily succeeds at all these tasks, and produces outputs that are essentially indistinguishable from (or even better than) what humans could produce. We also compare GPT-4's performance to those of previous LLMs, most notably ChatGPT, which is a ne-tuned version of (an improved) GPT-3 [BMR+20]. In Figure 1.2, we display the results of asking ChatGPT for both the in nitude of primes poem and the TikZ unicorn drawing. While the system performs non-trivially on both tasks, there is no comparison with the outputs from GPT-4. These preliminary observations will repeat themselves throughout the paper, on a great variety of tasks. The combination of the generality of GPT-4's capabilities, with numerous abilities spanning a broad swath of domains, and its performance on a wide spectrum of tasks at or beyond human-level, makes us comfortable with saying that GPT-4 is a signi cant step towards AGI.
Footnote 1 As GPT-4 's development continued after our experiments, one should expect different responses from the nal version of GPT-4. In particular, all quantitative results should be viewed as estimates of the model's potential, rather than de nitive numbers. We repeat this caveat throughout the paper to clarify that the experience on the deployed model may differ. Moreover, we emphasize that the version we tested was text-only for inputs, but for simplicity we refer to it as GPT-4 too.
Our claim that GPT-4 represents progress towards AGI does not mean that it is perfect at what it does, or that it comes close to being able to do anything that a human can do (which is one of the usual definition of AGI; see the conclusion section for more on this), or that it has inner motivation and goals (another key aspect in some definitions of AGI). In fact, even within the restricted context of the 1994 definition of intelligence, it is not fully clear how far GPT-4 can go along some of those axes of intelligence, e.g., planning (see Section 8), and arguably it is entirely missing the part on “learn quickly and learn from experience” as the model is not continuously updating (although it can learn within a session, see Section 5 for example). Overall GPT-4 still has many limitations, and biases, which we discuss in detail below and that are also covered in OpenAI’s report [Ope23]. In particular it still suffers from some of the well-documented shortcomings of LLMs such as the problem of hallucinations [MNBM20] (see Figure 1.8) or making basic arithmetic mistakes [CKB+21] (see Appendix D), and yet it has also overcome some fundamental obstacles such as acquiring many non-linguistic capabilities (e.g., it solves most of the LLM failure modes described in [MIB+23], and it also made great progress on common-sense, see Figure 1.7 for a first example and Appendix A for more). This highlights the fact that, while GPT-4 is at or beyond human-level for many tasks, overall its patterns of intelligence are decidedly not human-like. However, GPT-4 is almost certainly only a first step towards a series of increasingly generally intelligent systems, and in fact GPT-4 itself has improved throughout our time testing it, see Figure 1.3 for the evolution of the unicorn drawing over the course of a month of training2. Even as a first step, however, GPT-4 challenges a considerable number of widely held assumptions about machine intelligence, and exhibits emergent behaviors and capabilities whose sources and mechanisms are, at this moment, hard to discern precisely (see again the conclusion section for more discussion on this). Our primary goal in composing this paper is to share our exploration of GPT-4’s capabilities and limitations in support of our assessment that a technological leap has been achieved. We believe that GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond.
2 Note that the improving we refer to here is a slow type of learning, which eventually comes to a halt, as opposed to the fast-paced and real-time learning one would expect from an AGI.
Figure 1.3: We queried GPT-4 three times, at roughly equal time intervals over the span of a month while the system was being refined, with the prompt “Draw a unicorn in TikZ”. We can see a clear evolution in the sophistication of GPT-4’s drawings.
1.1 Our approach to studying GPT-4’s intelligence
How can we measure the intelligence of an LLM that has been trained on an unknown but extremely vast corpus of web-text data? The standard approach in machine learning is to evaluate the system on a set of standard benchmark datasets, ensuring that they are independent of the training data and that they cover a range of tasks and domains. This approach is designed to separate true learning from mere memorization, and is backed up by a rich theoretical framework [SSBD14, MRT18]. However, this methodology is not necessarily suitable for studying GPT-4, for two reasons. First, since we do not have access to the full details of its vast training data, we have to assume that it has potentially seen every existing benchmark, or at least some similar data. For example, it seems like GPT-4 knows the recently proposed BIG-bench [SRR+22] (at least GPT-4 knows the canary GUID from BIG-bench). Of course, OpenAI themselves have access to all the training details, and thus their report [Ope23] contains a lot of detailed benchmark results. Nevertheless, the second reason for going beyond traditional benchmarks is probably more significant: One of the key aspects of GPT- 4’s intelligence is its generality, the ability to seemingly understand and connect any topic, and to perform tasks that go beyond the typical scope of narrow AI systems. Some of GPT-4’s most impressive performance are on tasks that do not admit a single solution, such as writing a graphic user interface (GUI) or helping a human brainstorm on some work-related problem. Benchmarks for such generative or interactive tasks can be designed too, but the metric of evaluation becomes a challenge (see e.g., [PSZ+21] for some recent progress on this active research area in NLP). We note that criticisms of the standard approach to measure AI systems were also made in [Cho19], where a new benchmark was proposed to evaluate general intelligence. We do not test GPT-4 on the latter benchmark for the reasons previously mentioned, as well as the fact that the bench- mark is visual in nature and thus more appropriate for the multimodal version of GPT-4 described in [Ope23].
How can we measure the intelligence of an LLM that has been trained on an unknown but extremely vast corpus of web-text data? The standard approach in machine learning is to evaluate the system on a set of standard benchmark datasets, ensuring that they are independent of the training data and that they cover a range of tasks and domains. This approach is designed to separate true learning from mere memorization, and is backed up by a rich theoretical framework SSBD14, MRT18. However, this methodology is not necessarily suitable for studying GPT-4, for two reasons. First, since we do not have access to the full details of its vast training data, we have to assume that it has potentially seen every existing benchmark, or at least some similar data. For example, it seems like GPT-4 knows the recently proposed BIG-bench SRR+22 (at least GPT-4 knows the canary GUID from BIG-bench). Of course, OpenAI themselves have access to all the training details, and thus their report Ope23 contains a lot of detailed benchmark results. Nevertheless, the second reason for going beyond traditional benchmarks is probably more significant: One of the key aspects of GPT- 4’s intelligence is its generality, the ability to seemingly understand and connect any topic, and to perform tasks that go beyond the typical scope of narrow AI systems. Some of GPT-4’s most impressive performance are on tasks that do not admit a single solution, such as writing a graphic user interface (GUI) or helping a human brainstorm on some work-related problem. Benchmarks for such generative or interactive tasks can be designed too, but the metric of evaluation becomes a challenge (see e.g., PSZ+21 for some recent progress on this active research area in NLP). We note that criticisms of the standard approach to measure AI systems were also made in Cho19, where a new benchmark was proposed to evaluate general intelligence. We do not test GPT-4 on the latter benchmark for the reasons previously mentioned, as well as the fact that the bench- mark is visual in nature and thus more appropriate for the multimodal version of GPT-4 described in Ope23.
To overcome the limitations described above, we propose here a different approach to studying GPT-4 which is closer to traditional psychology rather than machine learning, leveraging human creativity and cu- riosity. We aim to generate novel and difficult tasks and questions that convincingly demonstrate that GPT-4 goes far beyond memorization, and that it has a deep and flexible understanding of concepts, skills, and domains (a somewhat similar approach was also proposed in CWF+22). We also aim to probe GPT-4’s responses and behaviors, to verify its consistency, coherence, and correctness, and to uncover its limitations and biases. We acknowledge that this approach is somewhat subjective and informal, and that it may not satisfy the rigorous standards of scientific evaluation. However, we believe that it is a useful and necessary first step to appreciate the remarkable capabilities and challenges of GPT-4, and that such a first step opens up new opportunities for developing more formal and comprehensive methods for testing and analyzing AI systems with more general intelligence.
To illustrate our approach to assessing GPT-4 intelligence, let us consider the first two example inter- actions with GPT-4 that we have in Figure 1.1. The first example is asking GPT-4 to write a proof of the infinitude of primes in the form of a poem. This is a challenging task that requires combining elementary ...
...
6 Interaction with humans
6.1 Understanding Humans: Theory of Mind
Theory of mind is the ability to attribute mental states such as beliefs, emotions, desires, intentions, and knowledge to oneself and others, and to understand how they affect behavior and communication [Wel92]. It includes the basic task of reflecting on someone else’s mental states, and the more advanced task of reflecting on someone’s reflection of someone else’s mental state (and so on). An example of the former skill is is needed to answer the question “What does Alice believe?”, while an example of the latter is needed to answer “What does Bob think that Alice believes?” Theory of mind is essential for effective communication and cooperation with other intelligent agents, as it allows one to infer their goals, preferences, motives, and expectations, and to adjust one’s own actions and utterances accordingly. Moreover, theory of mind is also important for learning from others, as it enables one to interpret their feedback, advice, and demonstrations.
6.1.1 Testing specific aspects of theory of mind
We designed a series of tests to evaluate the theory of mind capabilities of GPT-4, ChatGPT, and text-davinci-003. The tests are based on simple scenarios that require more basic or more advanced theory of mind to answer questions about the mental states of characters involved.
We start with a modernized version of the Sally-Anne test [BCLF85], a classic false-belief test that is widely used to assess theory of mind in children. To prevent an unfair comparison due to the effects of memorization, we modify the test by framing it in a situation that does not exist on the web, and thus could not have been seen during training. Figure 6.1 shows the input and output for GPT-4, which correctly answers that Alice will look for the file in the original folder, demonstrating it can reason about Alice’s beliefs. ChatGPT also answers correctly (not shown), while text-davinci-003 gives a wrong answer, saying that Alice will look for the file in the new folder.
We present a test on understanding emotions in Figure 6.2, where two characters talk about an object called ZURFIN (we use a nonsense word to test abstraction and prevent memorization). GPT-4 is able to reason correctly about the reasons for Tom’s emotional state, and also make good inferences about Adam’s beliefs about Tom’s emotional state (based on incomplete information). ChatGPT also passes the test, while text-davinci-003 (not shown) makes no reference to the conversation when answering the first question, and fails to account for Adam’s lack of information about the lost ZURFIN when answering the second question.
The third test (Figure 6.3) involves inferring possible intentions in the light of a puzzling action by one of the characters. GPT-4 gives plausible and nuanced answers both for the intentions behind the puzzling action and a third-party’s likely interpretation of the puzzling action. ChatGPT gives a similar answer to the first question (not shown), but unlike GPT-4, it does not offer a nuanced response to the second question, instead providing a more general and less informative answer. text-davinci-003 gives plausible but very short answers to both questions (not shown).
6.1.2 Testing theory of mind in realistic scenarios
In Figures 6.4, 6.5, and 6.6 we present realistic scenarios of difficult social situations, requiring very advanced theory of mind to understand. We ask probing questions, and also ask the models to propose actions that are likely to improve the situation, which require inferences about the counterfactual impact of actions on mental states. In Figure 6.4, GPT-4 is able to infer what each character’s mental state is, and also discern where miscommunication and misunderstanding lies. In contrast, both ChatGPT and text-davinci-003 (not shown) incorrectly accept a mistaken assumption made by one of the characters (Judy’s assumption that Mark wants to defend Jack’s behavior), and thus fail to understand the real dynamics of the situation. In turn, this leads to generic suggestions for improvement from ChatGPT and text-davinci-003, while GPT-4 provides suggestions that actually address the root cause of the misunderstanding. We see a similar pattern in Figures 6.5 and 6.6. Without implying that there is a “right” answer, we note that GPT-4 provides more nuanced answers, taking the whole scenario and actors into account. In contrast, ChatGPT provides more general answers which do not include reasoning about the characters’ state of mind (text-davinci-003 is similar, but shorter than ChatGPT).
...
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 SparksofArtificialGeneralIntell | Johannes Gehrke Yi Zhang Eric Horvitz Ece Kamar Varun Chandrasekaran Ronen Eldan Peter Lee Yin Tat Lee Yuanzhi Li Scott Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Sébastien Bubeck | Sparks of Artificial General Intelligence: Early Experiments with GPT-4 | 2023 |