2024 AddressingAnnotatedDataScarcity
- (Zin et al., 2024) ⇒ May Myo Zin, Ha Thanh Nguyen, Ken Satoh, and Fumihito Nishino. (2024). “Addressing Annotated Data Scarcity in Legal Information Extraction.” In: New Frontiers in Artificial Intelligence. ISBN:978-981-97-3076-6 doi:10.1007/978-981-97-3076-6_6
Subject Headings: Legal-Domain NER.
Notes
- It investigates the use of Generative Pre-trained Transformers (GPT) to address the challenge of annotated data scarcity in legal Named Entity Recognition (NER).
- It focuses on NER in the context of contractual legal cases, specifically those involving sale and purchase agreements, aiming to identify and extract entities such as SELLER, BUYER, CONTRACT_NAME, PURCHASE_PRODUCT, PURCHASE_PRICE, and PURCHASE_DATE.
- It explores the effectiveness of GPT models in generating human-like annotated data to overcome the limitations of manual annotation, which is labor-intensive, time-consuming, and expensive.
- It employs prompt engineering to guide GPT models (GPT-3 and GPT-4) in generating diverse and coherent annotated data, using carefully crafted prompts that specify the desired entity categories, annotation format, and encourage diversity in writing styles, entity values, and scenarios.
- It compares the performance of a BERT-based NER model fine-tuned on GPT-generated data with models trained on human-created data and the direct application of GPT models for zero-shot entity extraction.
- It demonstrates that the BERT-based NER model fine-tuned on GPT-3 generated data outperforms the model fine-tuned on human-created data, highlighting the potential of leveraging GPT models for automatic data creation and annotation.
- It evaluates the generalization capabilities of the NER models using various test sets, including seen and unseen scenarios, to assess their performance in real-world applications.
- It discusses the limitations of the study, such as the focus on a specific type of contractual legal cases and the financial constraints in conducting large-scale experiments, and outlines future work to expand the dataset and explore error detection and correction methods for GPT-generated data.
- It contributes to the broader goal of enhancing the accuracy and efficiency of information extraction in the legal domain, showcasing the transformative impact of advanced language models on addressing data scarcity issues and democratizing legal AI.
Cited By
Quotes
Abstract
Named Entity Recognition (NER) models face unique challenges in the field of legal text analysis, primarily due to the scarcity of annotated legal data. The creation of a diverse and representative legal text corpus is hindered by the labor-intensive, time-consuming, and expensive nature of manual annotation, leading to suboptimal model performance when trained on insufficient or biased data. This study explores the effectiveness of Generative Pre-trained Transformers (GPT) in overcoming these challenges. Leveraging the generative capabilities of GPT models, we use them as tools for creating human-like annotated data. Through experiments, our research reveals that the pre-trained BERT model, when fine-tuned on GPT-3 generated data, surpasses its counterpart fine-tuned on human-created data in the legal NER task. The demonstrated success of this methodology underscores the potential of large language models (LLMs) in advancing the development of more reliable and contextually aware Legal NER systems for intricate legal texts. This work contributes to the broader goal of enhancing the accuracy and efficiency of information extraction in the legal domain, showcasing the transformative impact of advanced language models on addressing data scarcity issues.
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2024 AddressingAnnotatedDataScarcity | May Myo Zin Ha Thanh Nguyen Ken Satoh Fumihito Nishino | Addressing Annotated Data Scarcity in Legal Information Extraction | 10.1007/978-981-97-3076-6_6 | 2024 |