2024 ScreensAccuracyEvaluationReport

(Harris, 2024a) ⇒ Evan Harris. (2024). “Screens Accuracy Evaluation Report.”

Subject Headings: TermScout.

Notes

Core Evaluation Focus: The study uses binary standards—pass/fail criteria—to objectively measure the accuracy of LLM-based contract review.
Data Composition: The evaluation involves a test set of 720 decisions, 51 contracts, and 132 unique standards drawn from multiple contract types.
High Accuracy Achieved: Screens’ production system reports a 97.5% accuracy rate, demonstrating that careful methodology and prompt design can yield near-human reliability.
Guidance Matters: Introducing explicit AI Guidance—detailed instructions and clarifications—improves accuracy dramatically, reducing errors by about 75%.
Proprietary Enhancements: Screens’ own retrieval and reasoning optimizations reduce errors by about 65%, showing the value of tailored techniques beyond off-the-shelf RAG systems.
LLM Comparisons: GPT-4 variants outperform Claude 3 and GPT-3.5, suggesting that the latest generations of LLMs are far more capable in contract review tasks.
Error Types: The analysis identifies four error categories: retrieval errors (missing cross-references), parsing errors (upstream text extraction issues), reasoning errors (faulty logic), and guidance errors (ambiguous or poorly defined standards).
Importance of Good Standards: Well-defined standards are crucial. Inadequate definition, vague exceptions, or unstated assumptions lead to guidance errors and incorrect results.
Foundational Building Block: The binary evaluation of standards is positioned as a fundamental layer for more complex tasks like redlining, negotiation, and due diligence.
Transparency & Reproducibility: The article emphasizes transparency and shares community screens for readers to attempt similar evaluations, fostering trust and enabling independent replication.

```

Cited By

http://scholar.google.com/scholar?q=%222024%22+Screens+Accuracy+Evaluation+Report

Quotes

Abstract

Evaluating the accuracy of large language models (LLMs) on contract review tasks is critical to understanding reliability in the field. However, objectivity is a challenge when evaluating long form, free text responses to prompts. We present an evaluation methodology that measures an LLM system’s ability to classify a contract as meeting or not meeting sets of substantive, well-defined standards. This approach serves as a foundational step for various use cases, including playbook execution, workflow routing, negotiation, redlining, summarization, due diligence, and more. We find that the Screens product, which employs this system, achieves a 97.5% accuracy rate. Additionally, we explore how different LLMs and methods impact AI accuracy.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2024 ScreensAccuracyEvaluationReport	Evan Harris			Screens Accuracy Evaluation Report						2024

2024 ScreensAccuracyEvaluationReport

Notes

Cited By

Quotes

Abstract

References

Navigation menu

Search