2024 ScreensAccuracyEvaluationReport
- (Harris, 2024a) ⇒ Evan Harris. (2024). “Screens Accuracy Evaluation Report.”
Subject Headings: TermScout.
Notes
- Core Evaluation Focus: The study uses binary standards—pass/fail criteria—to objectively measure the accuracy of LLM-based contract review.
- Data Composition: The evaluation involves a test set of 720 decisions, 51 contracts, and 132 unique standards drawn from multiple contract types.
- High Accuracy Achieved: Screens’ production system reports a 97.5% accuracy rate, demonstrating that careful methodology and prompt design can yield near-human reliability.
- Guidance Matters: Introducing explicit AI Guidance—detailed instructions and clarifications—improves accuracy dramatically, reducing errors by about 75%.
- Proprietary Enhancements: Screens’ own retrieval and reasoning optimizations reduce errors by about 65%, showing the value of tailored techniques beyond off-the-shelf RAG systems.
- LLM Comparisons: GPT-4 variants outperform Claude 3 and GPT-3.5, suggesting that the latest generations of LLMs are far more capable in contract review tasks.
- Error Types: The analysis identifies four error categories: retrieval errors (missing cross-references), parsing errors (upstream text extraction issues), reasoning errors (faulty logic), and guidance errors (ambiguous or poorly defined standards).
- Importance of Good Standards: Well-defined standards are crucial. Inadequate definition, vague exceptions, or unstated assumptions lead to guidance errors and incorrect results.
- Foundational Building Block: The binary evaluation of standards is positioned as a fundamental layer for more complex tasks like redlining, negotiation, and due diligence.
- Transparency & Reproducibility: The article emphasizes transparency and shares community screens for readers to attempt similar evaluations, fostering trust and enabling independent replication.
```
Cited By
Quotes
Abstract
Evaluating the accuracy of large language models (LLMs) on contract review tasks is critical to understanding reliability in the field. However, objectivity is a challenge when evaluating long form, free text responses to prompts. We present an evaluation methodology that measures an LLM system’s ability to classify a contract as meeting or not meeting sets of substantive, well-defined standards. This approach serves as a foundational step for various use cases, including playbook execution, workflow routing, negotiation, redlining, summarization, due diligence, and more. We find that the Screens product, which employs this system, achieves a 97.5% accuracy rate. Additionally, we explore how different LLMs and methods impact AI accuracy.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 ScreensAccuracyEvaluationReport | Evan Harris | Screens Accuracy Evaluation Report | 2024 |