2023 GPQAAGraduateLevelGoogleProofQA

Subject Headings: GPQA Benchmark

Notes

The paper presents GPQA, a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry, written by domain experts to be extremely difficult and "Google-proof."
The paper ensures high question quality by requiring experts with or pursuing PhDs in relevant fields to validate the questions, achieving 65% accuracy, whereas non-experts with unrestricted web access achieve only 34% accuracy.
The paper highlights the difficulty of the questions for state-of-the-art AI systems, with GPT-4-based models achieving 39% accuracy, emphasizing the need for scalable oversight methods to supervise AI outputs effectively.
The paper details a rigorous data collection and validation process involving multiple stages, including expert and non-expert validation, to ensure the objectivity and difficulty of the questions.
The paper discusses the creation of two curated subsets of the dataset: the main set (448 questions) and the diamond set (198 questions), filtered based on expert agreement and non-expert difficulty.
The paper demonstrates the dataset's applicability to scalable oversight research by providing a realistic testbed for evaluating human supervision of AI systems, particularly in generating truthful information in specialized domains.
The paper acknowledges limitations such as the dataset's small size and potential biases due to the sourcing of experts through Upwork, while highlighting the dataset's potential to inform future AI oversight protocols.

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are " Google-proof "). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2023 GPQAAGraduateLevelGoogleProofQA	Julian Michael Samuel R. Bowman Richard Yuanzhe Pang David Rein Betty Li Hou Asa Cooper Stickland Jackson Petty Julien Dirani			GPQA: A Graduate-Level Google-Proof Q&A Benchmark				10.48550/arXiv.2311.12022		2023