2024 HallucinationFreeAssessingtheRe
- (Magesh et al., 2024) ⇒ Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Daniel E Ho. (2024). “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.” In: Stanford preprint.
Subject Headings: AI-Driven Legal Research Tool, Legal Research Task.
Notes
- Evaluation Scope and Methodology:
- The study evaluates AI-driven legal research tools: LexisNexis's Lexis+ AI, Thomson Reuters's Ask Practical Law AI, Westlaw's AI-Assisted Research (AI-AR), and GPT-4, focusing on their tendency to produce false information or "hallucinations."
- The study utilized a diverse set of 202 preregistered legal queries categorized into general legal research, jurisdiction/time-specific questions, false premise questions, and factual recall questions.
- Responses were manually coded for correctness and groundedness, ensuring inter-rater reliability, with queries run on Lexis+ AI, Ask Practical Law AI, and GPT-4 between March 22 and April 22, 2024, and on Westlaw's AI-AR between May 23-27, 2024.
- Findings on Accuracy and Hallucination Rates:
- Lexis+ AI showed the highest accuracy with 65% correct responses, while Westlaw AI-AR had a 42% accuracy rate, and Practical Law AI had 20%. GPT-4 served as a baseline with a 49% accuracy rate.
- Lexis+ AI and Ask Practical Law AI hallucinated 17% of the time, while Westlaw AI-AR had a 33% hallucination rate, including incorrect legal information and misgrounded citations.
- The study validated every hallucination coding when updating evaluation results to include Westlaw AI-AR, leading to nearly identical findings, with minor changes in accuracy rates within the margin of inter-rater reliability.
- Responsiveness and Error Typology:
- Lexis+ AI, Westlaw AI-AR, and Ask Practical Law AI provided incomplete answers 18%, 25%, and 62% of the time, respectively. The length and detail of responses varied, with Westlaw providing the longest answers on average.
- The study identifies various errors made by legal RAG systems, such as misunderstanding case holdings, distinguishing between legal actors, errors in respecting the hierarchy of legal authority, and fabricating law provisions.
- Implications for AI Development and Evaluation Metrics:
- The study highlights the importance of legal AI companies providing clear, evidence-based information about their products' capabilities and limitations, emphasizing the need for transparency and responsible development practices.
- The study propose evaluating AI-generated responses based on correctness and groundedness, offering valuable insights for establishing comprehensive evaluation metrics for contract review AI platforms.
- Recommendations for Legal Professionals and Continuous Assessment:
- Legal professionals must supervise and verify AI outputs to mitigate risks, as improvements in RAG systems are necessary but not sufficient to eliminate hallucinations.
- The study underscores the necessity of ongoing, rigorous AI tool benchmarking and public evaluations of AI tools in law, which is crucial for ensuring the reliability and trustworthiness of contract review AI systems as they evolve and improve.
Cited By
Quotes
Abstract
Legal practice has witnessed a sharp rise in products incorporating artificial intelli - gence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the large language models used in these tools are prone to âhallucinate, â or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as âeliminatingâ (Casetext, 2023) or âavoid [ ing]â hallucinations (Thomson Reuters, 2023), or guaranteeing âhallucination-freeâ legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first pre - registered empirical evaluation of AI-driven legal research tools. We demonstrate that the providersâ claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis + AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a com - prehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.1
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 HallucinationFreeAssessingtheRe | Christopher D. Manning Daniel E Ho Mirac Suzgun Varun Magesh Faiz Surani Matthew Dahl | Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools | 2024 |