2024 TheRAGReportLargeLanguageModels
Jump to navigation
Jump to search
- (Raini et al., 2024) ⇒ Ron Raini, Mike Kennedy, Elliot White, and Kerry Westland. (2024). “The RAG Report: Large Language Models in Legal Due Diligence.” Addleshaw & Goddard (AG) whitepaper.
Subject Headings: Legal-Domain RAG, Automated Legal Due Diligence, Contract Clause Extraction.
Notes
- The publication represents the first systematic research paper by a law firm demonstrating how RAG can enhance LLMs for legal due diligence, achieving production-grade accuracy without model fine-tuning.
- The publication established a benchmark performance improvement in contract analysis, raising accuracy from 74% to 95% through systematic optimization of retrieval and generation components.
- The publication validated optimal chunking parameters for legal documents, determining that 3,500-character chunks with 700-character overlaps maximize retrieval effectiveness for contract clauses.
- The publication pioneered a hybrid retrieval approach combining vector embeddings with advanced keyword search, improving retrieval accuracy by 14-22% for complex legal provisions.
- The publication discovered that accusatory prompting in follow-up questions significantly improves LLM performance, introducing a novel technique for response validation.
- The publication demonstrated GPT-4-32K's superior performance with an F1 score improvement of 4-7% over GPT-4-Turbo in first prompt scenarios and 1-2% in follow-up prompt scenarios for legal extraction tasks.
- The publication quantified the performance gap between RAG-optimized LLMs (95%), traditional ML tools (86%), and generic GenAI tools (72%) for legal document analysis.
- The publication empirically evaluated model selection between GPT-4 variants through systematic comparison of clause extraction performance across multiple provision types, establishing baseline metrics for legal extraction tasks.
- The publication developed a reproducible methodology for testing clause extraction across multiple provision types using the CUAD dataset, enabling standardized evaluation of legal AI systems.
- The publication established best practices for prompt engineering in legal contexts, emphasizing concise instructions and targeted prompts to prevent context window saturation.
- The publication demonstrated the feasibility of in-house legal AI development, providing a framework for law firms to build customized due diligence solutions.
- The publication outlined a research agenda for advancing legal AI, including exploration of alternative LLMs, advanced chunking techniques, and automated risk assessment capabilities.
- The publication demonstrated that follow-up prompting provides consistent performance improvements across all configurations, with an average 9.2% accuracy increase across all tested scenarios.
- The publication identified that certain provisions (like Governing Law and Effective Date) are inherently easier to extract with >97% accuracy, while others (like Exclusivity and Cap on Liability) are more challenging at ~67-76% accuracy due to greater variability in formulation.
- The publication established that emotive/urgent language in prompts can enhance LLM performance, though this effect requires further systematic study.
- The publication validated that retrieving top 10 chunks provides optimal balance between retrieval completeness and context window utilization.
Cited By
Quotes
- Supporting the applicability across various legal tasks:
"The main focus is on how to increase the accuracy of LLMs by optimising the retrieval, extraction, and identification of relevant clauses in commercial agreements. The paper sets out the approach and methodology we employed in the testing and evaluation of LLM-powered systems for this specific use case. We discuss the insights and lessons learned with respect to our use case, as well as its applicability for a wider set of legal applications" . - Supporting the focus on risk assessment and compliance within legal tasks:
"There is a process for due diligence following the extraction of the Provisions where focused questions need to be applied to identify any corresponding risk. In this paper, we have focused on the retrieval and extraction aspects of the process, with risk identification currently ongoing. Each of these aspects will have dedicated prompts and parameters to achieve more accurate responses and results from the LLMs" . - Supporting large-scale diligence applications across diverse legal domains:
"To develop a platform that can help to deliver large-scale diligence exercises across a variety of business areas and that is able to do the following: Quickly classify, review, and analyse all documents within a deal data room; Apply pre-defined due diligence questions that are relevant to each document type; Generate a comprehensive draft of the due diligence (DD) report, providing detailed answers; Produce a concise risk report, highlighting key issues and potential risks; and Achieve this in a fraction of the time and manpower currently required to generate such reports" . - Supporting the extraction of commonly relevant commercial provisions:
"The specific clauses we were looking to extract and measure included a variety of common commercial provisions, such as Assignment, Audit Rights, Cap on Liability, Change of Control, Effective Date, Exclusivity, and Governing Law" . - Supporting "first systematic research paper" & "production-grade accuracy":
"This paper, which we believe is the first of its kind from a law firm, is a deep dive into the effectiveness of LLMs and their application to legal-specific tasks...Through our optimised approach, we can increase the accuracy of LLMs in commercial contract reviews on average, from 74% to 95%" . - Supporting optimal chunking parameters:
"Using Chunking Strategy 2 (breaking documents down into chunks of 3,500 characters, with an overlap of 700 characters on each side)...This enabled us to break documents into sensibly sized text excerpts, with the overlap allowing us to maintain clause structure and context" . - Supporting hybrid retrieval approach:
"We implemented a hybrid Retrieval Method that combined both Vector Search Queries and Advanced Keywords Queries. We preferred the Advanced Keywords Query over a simple keyword search, as it was more powerful and provided more flexibility and customization" . - Supporting accusatory prompting discovery:
"The Follow-Up Prompt variation that improved performance the most was when we directly accused the LLM of missing relevant information. This seemed to stimulate the LLM to validate its first response, reviewing the retrieved chunks fed into it with greater care" . - Supporting GPT-4-32K superiority:
"These results demonstrate the superiority of GPT-4-32K over GPT-4-Turbo in our Provision extraction task, with stronger performance on average as well as in most of the test scenarios" . - Supporting performance gap quantification:
"The Machine Learning Extraction Tool lagged behind most of the LLM-based configurations we tested, with an average F1 score of 0.86...Compared to all other tools and configurations, the GenAI Contract Review Tool was the worst performer, with an average F1 score of 0.72" . - Supporting CUAD dataset methodology:
"We selected these Provisions as they were the closest data points to the CUAD dataset...This enabled us to run tests on non-confidential documents and freely share the results, in order to evaluate performance and make all results verifiable" . - Supporting prompt engineering best practices:
"Following these findings, we decided to have a relatively simple System Prompt and only include the most high-level persona and task description instructions, moving some of the more specific instructions and directions into a Provision-Specific Prompt" . - Supporting in-house legal AI development feasibility:
"We found that a solution utilizing RAG can be optimized effectively through a mixture of the selected components, and that this could be done within our own infrastructure at AG by utilizing our own team" . - Supporting systematic model comparison and metrics:
"Furthermore, when comparing pairs of configurations across each Provision and for both First Prompt and Follow Up Prompt scenarios, we can see that out of 18 pairs, GPT4-32K achieved a higher F1 score in 11 cases, the same F1 score in four cases, and a lower F1 score in only three cases...Appendix 2 has more details of paired comparisons, such as 'RAG 10 – GPT4-Turbo – First Prompt (Assignment)' versus 'RAG 10 – GPT4-32K – First Prompt (Assignment)', 'RAG 10 – GPT4-Turbo – Follow Up (Assignment)' versus 'RAG 10 – GPT4-32K – Follow Up (Assignment)', and so on." . - Supporting evaluation methodology and metrics:
"To evaluate all the configurations and third-party tools tested using the criteria described above... When we take our next steps in this research, looking at Risk Identification, we will move towards more complex and subjective evaluation metrics and dedicated risks datasets that we will develop." .
Figure 14. Provisions Used in Testing |
---|
Provisions Used in Testing |
Assignment |
Audit Rights |
Cap on Liability |
Change of Control |
Effective Date |
Exclusivity |
Governing Law |
License Grant |
Termination for Convenience |
Table 1. Chunking Strategy Results Comparison: Recall@10 | All CUAD Contracts | Optimised vs Non-optimised Retrieval | |||||||
---|---|---|---|---|---|---|---|
Provision | # of Samples | Chunking Strategy 1 | Chunking Strategy 2 | Chunking Strategy 3 | |||
Recall@1 | Recall@7 | Recall@1 | Recall@5 | Recall@1 | Recall@4 | ||
Assignment | 225 | 40.44% | 75.11% | 49.33% | 76.89% | 46.67% | 68.00% |
Change of Control | 131 | 26.72% | 64.89% | 21.37% | 68.70% | 27.48% | 58.02% |
Effective Date | 89 | 38.20% | 80.90% | 44.94% | 71.91% | 47.19% | 70.79% |
Exclusivity | 151 | 9.93% | 36.42% | 13.25% | 33.11% | 10.60% | 35.76% |
Non-Compete | 123 | 27.64% | 55.28% | 30.08% | 57.72% | 21.95% | 41.46% |
ROFR/ROFO/ROFN | 206 | 21.84% | 67.48% | 44.66% | 68.93% | 30.58% | 64.56% |
Termination For Convenience | 80 | 28.75% | 76.25% | 50.00% | 81.25% | 46.25% | 76.25% |
Weighted Average | 27.56% | 64.58% | 36.62% | 65.17% | 32.44% | 58.81% |
Table 2. Retrieval Scenario 1 Results: Recall@10 | All CUAD Contracts | Optimised vs Non-optimised Retrieval | ||||
---|---|---|---|---|
Provision | # of Samples | Non-optimised Retrieval | Optimised Retrieval | Difference |
Assignment | 654 | 89.14% | 98.62% | 9.48% |
Audit Rights | 643 | 83.05% | 98.76% | 15.71% |
Cap on Liability | 672 | 88.69% | 97.47% | 8.78% |
Change of Control | 254 | 87.80% | 96.06% | 8.26% |
Effective Date | 448 | 94.42% | 99.55% | 5.13% |
Exclusivity | 410 | 51.95% | 95.12% | 43.17% |
Governing Law | 464 | 99.14% | 99.78% | 0.64% |
Insurance | 561 | 92.87% | 99.11% | 6.24% |
Licence Grant | 777 | 63.32% | 97.30% | 33.98% |
Minimum Commitment | 424 | 65.80% | 94.10% | 28.30% |
Non-Compete | 260 | 71.92% | 93.08% | 21.16% |
ROFR/ROFO/ROFN | 367 | 74.93% | 91.28% | 16.35% |
Termination For Convenience | 246 | 94.31% | 99.19% | 4.88% |
Warranty | 177 | 85.31% | 98.31% | 13.00% |
Weighted Average | 81.31% | 97.28% | 15.97% |
Table 3. Retrieval Scenario 2 Results: Recall@10 | CUAD Contracts +20 Chunks | Optimised vs Non-optimised Retrieval | ||||
---|---|---|---|---|
Provision | # of Samples | Non-Optimised Retrieval | Optimised Retrieval | Difference |
Assignment | 338 | 88.76% | 97.34% | 8.58% |
Audit Rights | 472 | 80.08% | 98.31% | 18.23% |
Cap on Liability | 375 | 80.53% | 95.47% | 14.94% |
Change of Control | 173 | 82.08% | 94.22% | 12.14% |
Effective Date | 157 | 90.45% | 98.73% | 8.28% |
Exclusivity | 215 | 26.98% | 90.70% | 63.72% |
Governing Law | 179 | 97.77% | 99.44% | 1.67% |
Insurance | 391 | 90.54% | 98.72% | 8.18% |
Licence Grant | 493 | 48.88% | 95.74% | 46.86% |
Minimum Commitment | 271 | 60.89% | 90.77% | 29.88% |
Non-Compete | 161 | 59.63% | 89.44% | 29.81% |
ROFR/ROFO/ROFN | 265 | 66.79% | 87.92% | 21.13% |
Termination For Convenience | 115 | 88.70% | 98.26% | 9.56% |
Warranty | 101 | 78.22% | 97.03% | 18.81% |
Weighted Average | 73.15% | 95.36% | 22.21% |
Table 4. Retrieval Scenario 3 Results: Recall@20 | CUAD Contracts +40 Chunks | Optimised vs Non-optimised Retrieval | ||||
---|---|---|---|---|
Provision | # of Samples | Non-Optimised Retrieval | Optimised Retrieval | Difference |
Assignment | 175 | 93.71% | 99.43% | 5.72% |
Audit Rights | 312 | 86.54% | 99.68% | 13.14% |
Cap on Liability | 165 | 86.06% | 95.15% | 9.09% |
Change of Control | 95 | 89.47% | 98.95% | 9.48% |
Effective Date | 62 | 91.94% | 100% | 8.06% |
Exclusivity | 117 | 53.85% | 97.44% | 43.59% |
Governing Law | 78 | 98.72% | 98.72% | 0.00% |
Insurance | 253 | 91.70% | 99.21% | 7.51% |
Licence Grant | 271 | 68.27% | 98.15% | 29.88% |
Minimum Commitment | 156 | 78.21% | 96.15% | 17.94% |
Non-Compete | 100 | 79.00% | 96.00% | 17.00% |
ROFR/ROFO/ROFN | 166 | 83.13% | 95.18% | 12.05% |
Termination For Convenience | 57 | 87.72% | 100% | 12.28% |
Warranty | 43 | 90.70% | 95.35% | 4.65% |
Weighted Average | 83.07% | 97.95% | 14.88% |
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 TheRAGReportLargeLanguageModels | Ron Raini Mike Kennedy Elliot White Kerry Westland | The RAG Report: Large Language Models in Legal Due Diligence | 2024 |