2023 GPT4PassestheBarExam

From GM-RKB
(Redirected from Katz et al., 2023)
Jump to navigation Jump to search

Subject Headings: U.S. Bar Exam, U.S. Uniform Bar Exam (UBE).

Notes

Cited By

2023

Quotes

Abstract

In this paper, we experimentally evaluate the zero-shot performance of a preliminary version of GPT-4 against prior generations of GPT on the entire Uniform Bar Examination (UBE), including not only the multiple-choice Multistate Bar Examination (MBE), but also the open-ended Multistate Essay Exam (MEE) and Multistate Performance Test (MPT) components. On the MBE, GPT-4 significantly outperforms both human test-takers and prior models, demonstrating a 26% increase over ChatGPT and beating humans in five of seven subject areas. On the MEE and MPT, which have not previously been evaluated by scholars, GPT-4 scores an average of 4.2/6.0 as compared to much lower scores for ChatGPT. Graded across the UBE components, in the manner in which a human tast-taker would be, GPT-4 scores approximately 297 points, significantly in excess of the passing threshold for all UBE jurisdictions. These findings document not just the rapid and remarkable advance of large language model performance generally, but also the potential for such models to support the delivery of legal services in society.

Introduction

It is difficult to imagine a professional field for which natural language is more integral than the law. As part of their daily activities, legal professionals like judges, regulators, legislators, and lawyers spend countless hours consuming and/or producing a wide variety of legal documents. The document types are varied but include legal texts such as statutes, regulations, judicial decisions, contracts, patents, briefs, opinion letters, memos, and other related materials.1,2

Legal language is notoriously complex,3,4,5 and the ability to interpret such complex documents often requires years of study. Indeed, part of the charge of legal education is, in fact, a linguistic immersion program where students are trained to parse both the syntactic and semantic nuances of various legal texts.6 There are many sources of complexity in legal language: for example, “words like ‘security’ that have common meaning in normal language often have different, context-specific meanings in legal language. Many words that do not occur at all in normal language, like ‘estoppel’ or ‘indemnitor,’ occur regularly in legal corpora.”7 This semantic depth and breadth is challenging for those not otherwise familiar with the legal lexicon. The public, for example, is quite aware of the linguistic gap between general language and legal language, referred to by many as “legalese.”

The complexity of the law8,9,10,11 imposes real consequences for many individuals and organizations.12,13 In part due to complexity, legal systems have struggled to assist with the quantity, quality, and accessibility of legal services demanded by society.13,14,15 A technology-based force multiplier15,16 is arguably needed to help support the high cost and unmet demand for legal services.17,18 Yet, in order for technology systems to meet this need, they must confront the nuances of legal languages and the difficulties of complex legal reasoning tasks.19 Unfortunately, from a historical perspective, computational technologies have struggled not only with natural language processing (NLP) tasks generally, but, in particular, with complex or domain-specific tasks like those in law.

There is promise on the horizon, however; state-of-the-art performance in NLP has advanced substantially over the last decade, largely driven by advances in computer hardware, data availability, and neural techniques. Indeed, cutting-edge work within the field of NLP has recently undergone a rapid transition where classical NLP methods have been supplanted by neural based methods.20,21 While neural techniques have a long history,22,23,24,25 current modeling approaches generally trace their lineage to the arc from shallow embeddings trained on CPUs to the current transformer-based architectures optimized for purpose-built, distributed GPU/TPU infrastructure.26–37

While there is an increasing number of generally-accessible Large Language Models (LLMs), the most well-known of these are from OpenAI’s family of Generative Pre-trained Transformer models, commonly referred to as GPT.34,38,39,40,41 In November 2022, OpenAI released a chat interface to a version of its “GPT-3.5” models, colloquially known as ChatGPT, which reportedly resulted in over 1M user sign-ups within six days of release and over 100M users in the first ninety days. As described by OpenAI, GPT-4 is “a transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF).”38 While this family of models encompasses a range of tasks, sizes, and training techniques and continues to expand, all models are generally trained using reinforcement learning or supervised fine-tuning on billions of tokens and parameters.

NLP models have progressed in the legal domain19,42,43 with increasing application of neural techniques on specific legal tasks.44,45,46 Several recent papers have demonstrated meaningful zero-shot progress on a variety of applied tasks,2,47,48,49,50,51 suggesting further potential for application as state-of-the-art improves.

Recognizing the advancing capabilities of large language models, we sought an exemplary challenge to demonstrate this potential to the legal domain. Thus, in recent prior work,7 a subset of the authors demonstrated the near-passing zero-shot performance of TEXT-DAVINCI-003 on the multiple choice component (MBE) of the Uniform Bar Exam – a task which requires both comprehensive domain knowledge and advanced semantic and syntactic command of the English language. While no prompts or parameters met a “passing” level, the rate of performance increase from TEXT-DAVINCI-001 to TEXT-DAVINCI-003 strongly suggested that passing performance could “occur within the next 0-18 months.”7 In this paper, we demonstrate that this time has come for not only the multiple choice component (MBE), but also the essay (MEE) and performance test (MPT) components of the UBE. As demonstrated by the zero-shot performance results we report herein, GPT-4 can “pass the Bar” in all UBE jurisdictions.

The Uniform Bar Exam

Description of the Uniform Bar Exam

The vast majority of jurisdictions in the United States require the completion of a professional licensure exam (“The Bar Exam”) as a precondition to practice law. The Bar Exam is a notoriously-challenging battery of tests designed to evaluate an applicant’s legal knowledge and skills. Successfully passing the Exam requires that an examinee can discern challenging factual and legal scenarios, understand and apply legal principles, and both consume and produce complex legal language.

In order to sit for the exam, the typical applicant must complete at least seven years of post-secondary education, including completion of a four-year undergraduate degree, followed by matriculation and graduation from a law school accredited by the American Bar Association. In addition to these years of education, most applicants also invest substantial amounts of time and money into specialized test-taking courses.52 Despite this effort and investment, roughly one in five test-takers is unable to pass the Exam on their first attempt.

Attorney licensure is a topic governed by the states, typically through rules promulgated at the direction of state supreme courts.53 Thus, each state is responsible for selecting its own requirements and methods of exam administration. Notwithstanding such broad authority, many states have selected to standardize their requirements. Over the past decade, more jurisdictions have chosen to participate in the Uniform Bar Exam (UBE).52,54 Despite this push toward greater uniformity, however, there are often additional requirements, even within states that have adopted the UBE, such as the Multistate Professional Responsibility Examination (MPRE) or state-specific subject matter areas. In this paper, we address only the UBE as produced by the National Conference of Bar Examiners (NCBE). The core UBE components, outlined in Table 1 below, are the Multistate Bar Exam (MBE), the Multistate Essay Exam (MEE) and Multistate Performance Test (MPT).

As shown in Table 1 and discussed in detail in Appendix I (pp.15 et seq.), the UBE is a 12-hour exam taken over two days, with the MPT and MEE administered on Day One while the MBE is administered on Day Two. The Uniform Bar Exam is scored on a total scale of 400 points, with the scores from all three sections scored together. In general, there are no minimums required for a specific component of the exam, as a strong score on one component can help an examinee overcome a weaker score on another component. As displayed in Table 8, a combined score of 266 points is enough to pass in jurisdictions such as Illinois, New York and the District of Columbia, while a score of 270 points would pass in the vast majority of states which use the UBE.

UBE Component	Total UBE Points	Questions	Time	Time Per Question
Multistate Bar Exam (MBE)
200 Points
200 Questions
(Multiple Choice)
6 Hours
1 min 48 sec
Multistate Essay Exam (MEE)	120 Points	6 Questions
(3-4 Subquestions)	3 Hours	30 minutes
Multistate Performance Test (MPT)	80 Points	2 Questions
(3-4 Subquestions)	3 Hours	90 minutes
Table 1. Summary of Uniform Bar Exam (UBE) Components

Data & Methods

Data

The primary focus of the NCBE is on the construction of exams for use on a nationwide basis. The NCBE exams are developed in an institutional context by the organization’s staff and advisors, who have many years of experience designing, scoring, and calibrating these exams across U.S. jurisdictions.

As noted earlier, the UBE has three separate components: the MBE, MEE and the MPT. In order to analyze whether GPT-4 could pass the Bar Exam, we collected relevant materials for each of the three separate UBE components. For the MEE and the MPT, we collected the most-recently-released questions from the July 2022 Bar Examination. These questions are readily available through the websites of many state bars. The July 2022 MEE exam features six questions, covering Evidence, Contracts, Corporations, Trusts, Civil Procedure, and Real Property. The two questions for July 2022 MPT required test-takers to (i) draft a memo in the context of a domestic relations matter with a series of choice of law issues and (ii) construct an objective memo focused on questions of criminal law and legal ethics.

The MBE questions used in this study are official multistate bar examination questions from previous administrations of the UBE.55 The MBE full-length exam we use is subject-weighted in near-equal proportion across the seven core subject matter areas. While the exact sequence of questions administered is not identical to any actual exam as administered, it has been described by the NCBE itself as “the first [MBE Complete Practice Exam] from NCBE to mimic a full-length MBE.”55 While we are not able to release the MBE questions, the questions can be purchased directly from an NCBE authorized reseller.

Links to access both the full length MEE and MPT questions, as well as their representative “good” answers, are avail- able in the online repository. These representative good answers are made available by state bar associations and reflect actual MEE and MPT answers produced by real examinees. These answers are described as neither ‘average passing answers nor are they necessarily perfect answers.” We would suggest that the interested reader reviews these representative ‘good’ answers side-by-side with our model outputs.

Methods

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 GPT4PassestheBarExamDaniel Martin Katz
Michael James Bommarito
Shang Gao
Pablo Arredondo
Gpt-4 Passes the Bar Exam10.2139/ssrn.43892332023