2024 ScreensRedliningEvaluation
Jump to navigation
Jump to search
- (Harris, 2024b) ⇒ Evan Harris. (2024). “Screens Redlining Evaluation.”
Subject Headings: TermScout.
Notes
- Focus on Correction: The study evaluates the system’s ability to correct failed contract standards through redline suggestions, rather than just identifying them.
- High Success Rate: The Screens system achieves a 97.6% success rate at converting failed standards into passes after applying its suggested redlines.
- Automated Methodology: The evaluation involves no human review; instead, the platform re-screens contracts after redlines are applied to measure improvements.
- Single Screen & Contract Type: The analysis uses one specific “[SaaS Savvy: Lower Value Purchases]” screen and 50 publicly available SaaS terms of service contracts for consistency and reproducibility.
- Narrow Success Metric: The only measure is whether failed standards now pass; considerations like brevity, etiquette, or counterparty acceptance are not included.
- Value of Redlines: This approach shows that the system isn’t limited to detecting issues—it can propose actionable edits that improve contract compliance.
- Complex Failures: Some standards require widespread changes throughout the contract, making them harder for the LLM to fix with a single round of edits.
- Ineffective Redlines: In a minority of cases, the suggested revision might not go far enough to turn a fail into a pass, underscoring the complexity of drafting perfect language.
- Review Errors: Occasionally, the tool may incorrectly judge the updated contract as failing even when it should pass, highlighting the remaining imperfections in LLM reasoning.
- Reproducibility & Transparency: The article provides enough detail for others to replicate the analysis, promoting transparency and trust in the evaluation process.
Cited By
Quotes
Abstract
Evaluating the accuracy of large language models (LLMs) on contract review tasks is critical to understanding reliability in the field. At Screens, we focus on application-specific ways to evaluate the performance of various aspects of our LLM stack. We’ve previously released an evaluation report that measures an LLM system’s ability to classify a contract as meeting or not meeting sets of substantive, well-defined standards.
Now, we turn our attention to the system’s ability to correct failed standards with suggested redlines. We find that the Screens product, which employs this system, achieves a 97.6% success rate at correcting failed standards with redlines.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 ScreensRedliningEvaluation | Evan Harris | Screens Redlining Evaluation | 2024 |