2023 GPQAAGraduateLevelGoogleProofQA

From GM-RKB
Jump to navigation Jump to search

Subject Headings: GPQA Benchmark

Notes

  • The paper presents GPQA, a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry, written by domain experts to be extremely difficult and "Google-proof."
  • The paper ensures high question quality by requiring experts with or pursuing PhDs in relevant fields to validate the questions, achieving 65% accuracy, whereas non-experts with unrestricted web access achieve only 34% accuracy.
  • The paper highlights the difficulty of the questions for state-of-the-art AI systems, with GPT-4-based models achieving 39% accuracy, emphasizing the need for scalable oversight methods to supervise AI outputs effectively.
  • The paper details a rigorous data collection and validation process involving multiple stages, including expert and non-expert validation, to ensure the objectivity and difficulty of the questions.
  • The paper discusses the creation of two curated subsets of the dataset: the main set (448 questions) and the diamond set (198 questions), filtered based on expert agreement and non-expert difficulty.
  • The paper demonstrates the dataset's applicability to scalable oversight research by providing a realistic testbed for evaluating human supervision of AI systems, particularly in generating truthful information in specialized domains.
  • The paper acknowledges limitations such as the dataset's small size and potential biases due to the sourcing of experts through Upwork, while highlighting the dataset's potential to inform future AI oversight protocols.

Cited By

Quotes

Abstract

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are " Google-proof "). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 GPQAAGraduateLevelGoogleProofQAJulian Michael
Samuel R. Bowman
Richard Yuanzhe Pang
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Julien Dirani
GPQA: A Graduate-Level Google-Proof Q&A Benchmark10.48550/arXiv.2311.120222023