BigLaw Bench
Jump to navigation
Jump to search
A BigLaw Bench is a legal AI benchmark (evaluates legal AIs on complex, real-world legal tasks).
- Context:
- It can (typically) developed by the Harvey.ai's legal research team.
- It can (typically) use tasks derived from real legal time entries, representing billable legal work performed by lawyers.
- It can (typically) evaluate Legal Answer Quality by assessing the extent to which the LLM produces a lawyer-quality work product, accounting for both positive achievements and negative errors.
- It can (typically) assess Source Reliability by measuring the LLM's ability to provide accurate and verifiable legal references to support its assertions.
- It can (typically) measure the performance of Legal AI Systems on tasks like Document Drafting, Legal Reasoning, and Risk Assessment.
- It can (often) be used to test AIs against actual billable work done by lawyers, such as Transactional Tasks and Litigation Tasks.
- It can (often) focus on assessing tasks that mirror actual billable work performed by lawyers, such as risk assessment, legal document drafting, and client advisory.
- It can (often) focus on both transactional and litigation tasks, covering different legal practice areas and the nature of legal matters.
- It can (often) be used by law firms and AI developers to fine-tune LLMs for legal tasks, improving their real-world utility and reducing errors like hallucination.
- ...
- It can range from evaluating simpler drafting tasks to assessing complex risk assessments requiring nuanced legal knowledge and accurate citations.
- ...
- It can employ custom legal rubrics to evaluate LLM performance, considering factors like task completion, professional tone, legal relevance, and common failure modes such as hallucinations.
- It can measure performance through two main metrics: Answer Quality and Source Reliability.
- It can supplement traditional legal AI benchmarks, which may focus on multiple-choice legal exams, by introducing tasks that mirror real legal work.
- It can help identify areas where LLMs excel, such as legal drafting tasks, while highlighting weaknesses in tasks requiring deep legal reasoning and accurate sourcing.
- It can include a variety of task types within transactional and litigation domains, offering a comprehensive assessment of LLM capabilities in legal practice.
- It can provide insights into the capacity of public LLMs to generate high-quality legal content but with challenges in providing reliable sourcing.
- It can help benchmark models for specific legal domains, such as Corporate Law, Intellectual Property Law, and Contract Law.
- It can utilize metrics like the Answer Score and Source Score to measure the completeness and verifiability of a model's outputs in comparison to lawyer-quality work.
- ...
- Example(s):
- ... ? draft a confidentiality clause for a merger agreement, evaluated on completeness, legal accuracy, and appropriateness for the client’s needs.
- ... ? perform a risk assessment for a client’s proposed business action, requiring identification of relevant laws and regulations with proper citations.
- ...
- Counter-Example(s):
- LegalBench, a comprehensive benchmark featuring tasks across legal domains, testing the ability of LLMs to perform IRAC-style reasoning and practical legal applications like contract clause identification.
- CUAD (Contract Understanding Atticus Dataset), a benchmark focused on AI's ability to classify and extract legal clauses from contracts, important for transactional law.
- Rule QA Task, a subtask within LegalBench that evaluates how accurately LLMs answer questions about specific legal rules.
- See: Legal AI Systems, Transactional Tasks, Litigation Tasks, Legal Research Tools
References
2024
- https://www.harvey.ai/blog/introducing-biglaw-bench
- NOTES:
- BigLaw Bench is a framework for evaluating large language models (LLMs) on complex legal tasks using real-world examples.
- The tasks used in BigLaw Bench are derived from time entries, which represent billable legal work performed by lawyers, covering tasks like risk assessment, drafting documents, and client advisory.
- Existing benchmarks like multiple-choice tests are insufficient to evaluate the complex legal tasks that lawyers perform.
- BigLaw Bench focuses on litigation and transactional tasks, reflecting different practice areas and the nature of legal matters.
- Custom rubrics are used to evaluate the LLMs' performance, considering factors like task completion, tone, relevance, and common failure modes (e.g., hallucinations).
- Answer score measures how much of a lawyer-quality work product the LLM completes, considering both positive achievements and negative errors.
- Source score measures the LLM’s ability to provide accurate references to support its assertions, ensuring traceability and verifiability.
- Public LLMs often perform well in generating content but struggle with providing accurate sourcing, leading to lower source scores.
- Transactional tasks generally see better model performance, as they are more analytical, whereas litigation tasks require ideation and argumentation, areas where LLMs underperform.
- Foundation models tend to hallucinate sources when explicitly asked to provide references, leading to lower accuracy in sourcing.
- The evaluation methodology includes a combination of positive and negative scoring, which highlights both the strengths and weaknesses of LLMs in real-world legal tasks.
- NOTES:
2024
- https://github.com/harveyai/biglaw-bench
- NOTES:
- BigLaw Bench is a comprehensive framework for evaluating large language models (LLMs) on complex, real-world legal tasks.
- Developed by Harvey's legal research team, it aims to supplement existing benchmarks by focusing on tasks that mirror actual billable work lawyers perform.
- Tasks are organized into two primary categories: Transactional Task Categories and Litigation Task Categories, each with several specific task types.
- The evaluation methodology uses custom-designed rubrics that measure Answer Quality (completeness, accuracy, appropriateness) and Source Reliability (verifiable and correctly cited sources).
- Scores are calculated by combining positive points for meeting task requirements and negative points for errors or missteps, such as hallucinations.
- The final answer score represents the percentage of a lawyer-quality work product the LLM completes.
- A data sample is available for preview, and full access to the dataset can be obtained by contacting Harvey directly.
- NOTES: