2024 ScreensRedliningEvaluation

(Redirected from Harris, 2024b)
Jump to navigation Jump to search

Subject Headings: TermScout.


Cited By



Evaluating the accuracy of large language models (LLMs) on contract review tasks is critical to understanding reliability in the field. At Screens, we focus on application-specific ways to evaluate the performance of various aspects of our LLM stack. We’ve previously released an evaluation report that measures an LLM system’s ability to classify a contract as meeting or not meeting sets of substantive, well-defined standards.

Now, we turn our attention to the system’s ability to correct failed standards with suggested redlines. We find that the Screens product, which employs this system, achieves a 97.6% success rate at correcting failed standards with redlines.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 ScreensRedliningEvaluationEvan HarrisScreens Redlining Evaluation2024