2024 TheRAGReportLargeLanguageModels

(Raini et al., 2024) ⇒ Ron Raini, Mike Kennedy, Elliot White, and Kerry Westland. (2024). “The RAG Report: Large Language Models in Legal Due Diligence.” Addleshaw & Goddard (AG) whitepaper.

Subject Headings: Legal-Domain RAG, Automated Legal Due Diligence, Contract Clause Extraction.

Notes

The publication represents the first systematic research paper by a law firm demonstrating how RAG can enhance LLMs for legal due diligence, achieving production-grade accuracy without model fine-tuning.
The publication established a benchmark performance improvement in contract analysis, raising accuracy from 74% to 95% through systematic optimization of retrieval and generation components.
The publication validated optimal chunking parameters for legal documents, determining that 3,500-character chunks with 700-character overlaps maximize retrieval effectiveness for contract clauses.
The publication pioneered a hybrid retrieval approach combining vector embeddings with advanced keyword search, improving retrieval accuracy by 14-22% for complex legal provisions.
The publication discovered that accusatory prompting in follow-up questions significantly improves LLM performance, introducing a novel technique for response validation.
The publication demonstrated GPT-4-32K's superior performance with an F1 score improvement of 4-7% over GPT-4-Turbo in first prompt scenarios and 1-2% in follow-up prompt scenarios for legal extraction tasks.
The publication quantified the performance gap between RAG-optimized LLMs (95%), traditional ML tools (86%), and generic GenAI tools (72%) for legal document analysis.
The publication empirically evaluated model selection between GPT-4 variants through systematic comparison of clause extraction performance across multiple provision types, establishing baseline metrics for legal extraction tasks.
The publication developed a reproducible methodology for testing clause extraction across multiple provision types using the CUAD dataset, enabling standardized evaluation of legal AI systems.
The publication established best practices for prompt engineering in legal contexts, emphasizing concise instructions and targeted prompts to prevent context window saturation.
The publication demonstrated the feasibility of in-house legal AI development, providing a framework for law firms to build customized due diligence solutions.
The publication outlined a research agenda for advancing legal AI, including exploration of alternative LLMs, advanced chunking techniques, and automated risk assessment capabilities.
The publication demonstrated that follow-up prompting provides consistent performance improvements across all configurations, with an average 9.2% accuracy increase across all tested scenarios.
The publication identified that certain provisions (like Governing Law and Effective Date) are inherently easier to extract with >97% accuracy, while others (like Exclusivity and Cap on Liability) are more challenging at ~67-76% accuracy due to greater variability in formulation.
The publication established that emotive/urgent language in prompts can enhance LLM performance, though this effect requires further systematic study.
The publication validated that retrieving top 10 chunks provides optimal balance between retrieval completeness and context window utilization.

Cited By

http://scholar.google.com/scholar?q=%222024%22+The+RAG+Report%3A+Large+Language+Models+in+Legal+Due+Diligence

Quotes

Supporting the applicability across various legal tasks:
"The main focus is on how to increase the accuracy of LLMs by optimising the retrieval, extraction, and identification of relevant clauses in commercial agreements. The paper sets out the approach and methodology we employed in the testing and evaluation of LLM-powered systems for this specific use case. We discuss the insights and lessons learned with respect to our use case, as well as its applicability for a wider set of legal applications" .
Supporting the focus on risk assessment and compliance within legal tasks:
"There is a process for due diligence following the extraction of the Provisions where focused questions need to be applied to identify any corresponding risk. In this paper, we have focused on the retrieval and extraction aspects of the process, with risk identification currently ongoing. Each of these aspects will have dedicated prompts and parameters to achieve more accurate responses and results from the LLMs" .
Supporting large-scale diligence applications across diverse legal domains:
"To develop a platform that can help to deliver large-scale diligence exercises across a variety of business areas and that is able to do the following: Quickly classify, review, and analyse all documents within a deal data room; Apply pre-defined due diligence questions that are relevant to each document type; Generate a comprehensive draft of the due diligence (DD) report, providing detailed answers; Produce a concise risk report, highlighting key issues and potential risks; and Achieve this in a fraction of the time and manpower currently required to generate such reports" .
Supporting the extraction of commonly relevant commercial provisions:
"The specific clauses we were looking to extract and measure included a variety of common commercial provisions, such as Assignment, Audit Rights, Cap on Liability, Change of Control, Effective Date, Exclusivity, and Governing Law" .
Supporting "first systematic research paper" & "production-grade accuracy":
"This paper, which we believe is the first of its kind from a law firm, is a deep dive into the effectiveness of LLMs and their application to legal-specific tasks...Through our optimised approach, we can increase the accuracy of LLMs in commercial contract reviews on average, from 74% to 95%" .
Supporting optimal chunking parameters:
"Using Chunking Strategy 2 (breaking documents down into chunks of 3,500 characters, with an overlap of 700 characters on each side)...This enabled us to break documents into sensibly sized text excerpts, with the overlap allowing us to maintain clause structure and context" .
Supporting hybrid retrieval approach:
"We implemented a hybrid Retrieval Method that combined both Vector Search Queries and Advanced Keywords Queries. We preferred the Advanced Keywords Query over a simple keyword search, as it was more powerful and provided more flexibility and customization" .
Supporting accusatory prompting discovery:
"The Follow-Up Prompt variation that improved performance the most was when we directly accused the LLM of missing relevant information. This seemed to stimulate the LLM to validate its first response, reviewing the retrieved chunks fed into it with greater care" .
Supporting GPT-4-32K superiority:
"These results demonstrate the superiority of GPT-4-32K over GPT-4-Turbo in our Provision extraction task, with stronger performance on average as well as in most of the test scenarios" .
Supporting performance gap quantification:
"The Machine Learning Extraction Tool lagged behind most of the LLM-based configurations we tested, with an average F1 score of 0.86...Compared to all other tools and configurations, the GenAI Contract Review Tool was the worst performer, with an average F1 score of 0.72" .
Supporting CUAD dataset methodology:
"We selected these Provisions as they were the closest data points to the CUAD dataset...This enabled us to run tests on non-confidential documents and freely share the results, in order to evaluate performance and make all results verifiable" .
Supporting prompt engineering best practices:
"Following these findings, we decided to have a relatively simple System Prompt and only include the most high-level persona and task description instructions, moving some of the more specific instructions and directions into a Provision-Specific Prompt" .
Supporting in-house legal AI development feasibility:
"We found that a solution utilizing RAG can be optimized effectively through a mixture of the selected components, and that this could be done within our own infrastructure at AG by utilizing our own team" .
Supporting systematic model comparison and metrics:
"Furthermore, when comparing pairs of configurations across each Provision and for both First Prompt and Follow Up Prompt scenarios, we can see that out of 18 pairs, GPT4-32K achieved a higher F1 score in 11 cases, the same F1 score in four cases, and a lower F1 score in only three cases...Appendix 2 has more details of paired comparisons, such as 'RAG 10 – GPT4-Turbo – First Prompt (Assignment)' versus 'RAG 10 – GPT4-32K – First Prompt (Assignment)', 'RAG 10 – GPT4-Turbo – Follow Up (Assignment)' versus 'RAG 10 – GPT4-32K – Follow Up (Assignment)', and so on." .
Supporting evaluation methodology and metrics:
"To evaluate all the configurations and third-party tools tested using the criteria described above... When we take our next steps in this research, looking at Risk Identification, we will move towards more complex and subjective evaluation metrics and dedicated risks datasets that we will develop." .

Figure 14. Provisions Used in Testing
Provisions Used in Testing
Assignment
Audit Rights
Cap on Liability
Change of Control
Effective Date
Exclusivity
Governing Law
License Grant
Termination for Convenience

Table 1. Chunking Strategy Results Comparison: Recall@10 \| All CUAD Contracts \| Optimised vs Non-optimised Retrieval
Provision	# of Samples	Chunking Strategy 1		Chunking Strategy 2		Chunking Strategy 3
		Recall@1	Recall@7	Recall@1	Recall@5	Recall@1	Recall@4
Assignment	225	40.44%	75.11%	49.33%	76.89%	46.67%	68.00%
Change of Control	131	26.72%	64.89%	21.37%	68.70%	27.48%	58.02%
Effective Date	89	38.20%	80.90%	44.94%	71.91%	47.19%	70.79%
Exclusivity	151	9.93%	36.42%	13.25%	33.11%	10.60%	35.76%
Non-Compete	123	27.64%	55.28%	30.08%	57.72%	21.95%	41.46%
ROFR/ROFO/ROFN	206	21.84%	67.48%	44.66%	68.93%	30.58%	64.56%
Termination For Convenience	80	28.75%	76.25%	50.00%	81.25%	46.25%	76.25%
Weighted Average		27.56%	64.58%	36.62%	65.17%	32.44%	58.81%

Table 2. Retrieval Scenario 1 Results: Recall@10 \| All CUAD Contracts \| Optimised vs Non-optimised Retrieval
Provision	# of Samples	Non-optimised Retrieval	Optimised Retrieval	Difference
Assignment	654	89.14%	98.62%	9.48%
Audit Rights	643	83.05%	98.76%	15.71%
Cap on Liability	672	88.69%	97.47%	8.78%
Change of Control	254	87.80%	96.06%	8.26%
Effective Date	448	94.42%	99.55%	5.13%
Exclusivity	410	51.95%	95.12%	43.17%
Governing Law	464	99.14%	99.78%	0.64%
Insurance	561	92.87%	99.11%	6.24%
Licence Grant	777	63.32%	97.30%	33.98%
Minimum Commitment	424	65.80%	94.10%	28.30%
Non-Compete	260	71.92%	93.08%	21.16%
ROFR/ROFO/ROFN	367	74.93%	91.28%	16.35%
Termination For Convenience	246	94.31%	99.19%	4.88%
Warranty	177	85.31%	98.31%	13.00%
Weighted Average		81.31%	97.28%	15.97%

Table 3. Retrieval Scenario 2 Results: Recall@10 \| CUAD Contracts +20 Chunks \| Optimised vs Non-optimised Retrieval
Provision	# of Samples	Non-Optimised Retrieval	Optimised Retrieval	Difference
Assignment	338	88.76%	97.34%	8.58%
Audit Rights	472	80.08%	98.31%	18.23%
Cap on Liability	375	80.53%	95.47%	14.94%
Change of Control	173	82.08%	94.22%	12.14%
Effective Date	157	90.45%	98.73%	8.28%
Exclusivity	215	26.98%	90.70%	63.72%
Governing Law	179	97.77%	99.44%	1.67%
Insurance	391	90.54%	98.72%	8.18%
Licence Grant	493	48.88%	95.74%	46.86%
Minimum Commitment	271	60.89%	90.77%	29.88%
Non-Compete	161	59.63%	89.44%	29.81%
ROFR/ROFO/ROFN	265	66.79%	87.92%	21.13%
Termination For Convenience	115	88.70%	98.26%	9.56%
Warranty	101	78.22%	97.03%	18.81%
Weighted Average		73.15%	95.36%	22.21%

Table 4. Retrieval Scenario 3 Results: Recall@20 \| CUAD Contracts +40 Chunks \| Optimised vs Non-optimised Retrieval
Provision	# of Samples	Non-Optimised Retrieval	Optimised Retrieval	Difference
Assignment	175	93.71%	99.43%	5.72%
Audit Rights	312	86.54%	99.68%	13.14%
Cap on Liability	165	86.06%	95.15%	9.09%
Change of Control	95	89.47%	98.95%	9.48%
Effective Date	62	91.94%	100%	8.06%
Exclusivity	117	53.85%	97.44%	43.59%
Governing Law	78	98.72%	98.72%	0.00%
Insurance	253	91.70%	99.21%	7.51%
Licence Grant	271	68.27%	98.15%	29.88%
Minimum Commitment	156	78.21%	96.15%	17.94%
Non-Compete	100	79.00%	96.00%	17.00%
ROFR/ROFO/ROFN	166	83.13%	95.18%	12.05%
Termination For Convenience	57	87.72%	100%	12.28%
Warranty	43	90.70%	95.35%	4.65%
Weighted Average		83.07%	97.95%	14.88%

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2024 TheRAGReportLargeLanguageModels	Ron Raini Mike Kennedy Elliot White Kerry Westland			The RAG Report: Large Language Models in Legal Due Diligence						2024

2024 TheRAGReportLargeLanguageModels

Notes

Cited By

Quotes

References

Navigation menu

Search