2024 LexEvalAComprehensiveChineseLeg
- (Li, Chen et al., 2024) ⇒ Haitao Li, You Chen, Qingyao Ai, Yueyue Wu, Ruizhe Zhang, and Yiqun Liu. (2024). “LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models.” In: arXiv preprint arXiv:2409.20288.
Subject Headings: LexEval, Legal Benchmark, Legal Cognitive Ability Taxonomy (LexAbility), Legal Memorization Task, Legal Understanding Task, Legal Logic Inference Task, Legal Discrimination Task, Legal Generation Task, and Legal Ethics Task.
Notes
- The paper introduces LexEval, the largest Chinese legal benchmark for evaluating large language models, comprising 23 legal tasks and 14,150 evaluation questions.
- The paper proposes a novel Legal Cognitive Ability Taxonomy (LexAbility) that categorizes legal tasks into six dimensions: Legal Memorization Task, Legal Understanding Task, Legal Logic Inference Task, Legal Discrimination Task, Legal Generation Task, and Legal Ethics Task.
- The paper reveals that general-purpose large language models like GPT-4 outperform legal-specific models, but still struggle with specific Chinese legal knowledge.
- The paper demonstrates that increasing model size generally improves performance in legal tasks, as evidenced by the comparison between Qwen-14B and Qwen-7B.
- The paper highlights a significant performance gap in large language models for tasks requiring memorization of legal facts and ethical judgment.
- The paper identifies strengths in large language models for Understanding and Logic Inference within the legal domain.
- The paper exposes limitations in current large language models for Discrimination and Generation in legal applications.
- The paper emphasizes the need for specialized training in Chinese legal knowledge to improve large language models performance in legal tasks.
- The paper underscores the importance of enhancing ethical reasoning capabilities in large language models for legal contexts.
- The paper suggests that continuous pre-training on legal corpora alone is insufficient for developing effective legal-specific large language models.
- The paper advocates for human-AI collaboration in legal practice, emphasizing that large language models should assist rather than replace legal professionals.
Cited By
Quotes
Abstract
Large language models (LLMs) have made significant progress in natural language processing tasks and demonstrate considerable potential in the legal domain. However, legal applications demand high standards of accuracy, reliability, and fairness. Applying existing LLMs to legal systems without careful evaluation of their potential and limitations could pose significant risks in legal practice. To this end, we introduce a standardized comprehensive Chinese legal benchmark LexEval. This benchmark is notable in the following three aspects: (1) Ability Modeling: We propose a new taxonomy of legal cognitive abilities to organize different tasks. (2) Scale: To our knowledge, LexEval is currently the largest Chinese legal evaluation dataset, comprising 23 tasks and 14,150 questions. (3) Data: we utilize formatted existing datasets, exam datasets and newly annotated datasets by legal experts to comprehensively evaluate the various capabilities of LLMs. LexEval not only focuses on the ability of LLMs to apply fundamental legal knowledge but also dedicates efforts to examining the ethical issues involved in their application. We evaluated 38 open-source and commercial LLMs and obtained some interesting findings. The experiments and findings offer valuable insights into the challenges and potential solutions for developing Chinese legal systems and LLM evaluation pipelines. The LexEval dataset and leaderboard are publicly available at \url{this https URL} and will be continuously updated.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 LexEvalAComprehensiveChineseLeg | You Chen Haitao Li Qingyao Ai Yueyue Wu Ruizhe Zhang Yiqun Liu | LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models | 2024 |