2023 HowisChatGPTsBehaviorChangingov
- (Chen, Zaharia and Zou, 2023) ⇒ Lingjiao Chen, Matei Zaharia, and James Zou. (2023). “How is ChatGPT's Behavior Changing over Time?.” In: arXiv preprint arXiv:2307.09009. doi:10.48550/arXiv.2307.09009
Subject Headings: LLM Benchmarking.
Notes
- Significant drifts were observed in the LLMs' performance and behavior over time. For some tasks, the accuracy dropped substantially from March to June (e.g. GPT-4's accuracy on prime number identification fell from 84% to 51%).
- On the other hand, performance improved on some other tasks (e.g. GPT-3.5's accuracy on prime numbers increased from 50% to 76%).
- The chain-of-thought reasoning approach became less effective over time, especially for GPT-4. This contributed to performance drops on math tasks.
- GPT-4 became less willing to answer subjective questions and sensitive queries in June compared to March.
- GPT-4 performed better on multi-hop reasoning questions in June but GPT-3.5 became worse.
- More formatting mistakes occurred in code generation in June versus March for both models.
- Overall, the behavior and performance of the "same" LLM can change substantially over a short period.
Cited By
Quotes
Abstract
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.
Introduction
...
- Figure 2: ... GPT-4 followed the chain-of-thought instruction to obtain the right answer in March, but ignored it in June with the wrong answer. ...
- ... This interesting phenomenon indicates that the same prompting approach, even the widely adopted chain-of-thought strategy, could lead to substantially different performances due to LLM drifts. ..."
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 HowisChatGPTsBehaviorChangingov | Matei Zaharia Lingjiao Chen James Zou | How is ChatGPT's Behavior Changing over Time? | 10.48550/arXiv.2307.09009 | 2023 |