2020 LEDGARALargeScaleMultiLabelCorp

(Tuggener et al., 2020) ⇒ Don Tuggener, Pius Von Däniken, Thomas Peetz, and Mark Cieliebak. (2020). “LEDGAR: A Large-scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts.” In: Proceedings of the Twelfth Language Resources and Evaluation Conference.

Subject Headings: EDGAR DB, LEDGAR.

Notes

Extensive Label Set: LEDGAR includes a comprehensive set of over 12,000 labels used for the classification of legal provisions in contracts. This extensive label set supports the large-scale or extreme text classification within the legal domain .
Multi-label Classification: The dataset supports multi-label classification, where each provision in a contract can be associated with multiple labels. This feature reflects the complexity and multiple applicable legal considerations for each provision .
Label Hierarchy and Decomposition: The dataset includes a latent hierarchy within the labels, where longer label names often subsume shorter ones. This hierarchical structure has been used to decompose longer labels into shorter, more manageable labels for classification purposes .
Automatic Label Extraction: Labels in the LEDGAR dataset were extracted automatically from legal documents sourced from SEC filings. These labels were originally assigned by the contract creators and reflect the legal terminologies and considerations relevant at the time of drafting.
Provision and Label Statistics: The dataset contains provisions labeled at the paragraph level, with significant variability in terms of label frequency and provision length. The document reports that some labels appear very frequently, while others are rare, affecting the distribution and classification performance .
In the context of the LEDGAR dataset, a "provision" is defined as a segment of text within a contract that specifies a particular clause or condition. Each provision addresses specific aspects of the contract, such as obligations, rights, procedures, or restrictions applicable to the contracting parties. These provisions are typically structured as paragraphs or distinct sections within a legal document and are labeled to facilitate automated classification and analysis based on their content. The dataset treats each of these provisions as individual units of text to apply labels, which represent the legal concepts or actions discussed within the provision. This approach enables the detailed analysis and classification of legal documents by segmenting them into manageable and contextually significant parts.
"Contractual provisions are a primary research target in law studies, as they are the essential discourse units when drafting, negotiating, validating, or analyzing contracts. Each provision in a contract constitutes a legal speech act, and the concatenation of the provisions comprises the legal essence of a contract.".

Cited By

http://scholar.google.com/scholar?q=%222020%22+LEDGAR%3A+A+Large-scale+Multi-label+Corpus+for+Text+Classification+of+Legal+Provisions+in+Contracts

Quotes

Abstract

We present LEDGAR, a multilabel corpus of legal provisions in contracts. The corpus was crawled and scraped from the public domain (SEC filings) and is, to the best of our knowledge, the first freely available corpus of its kind. Since the corpus was constructed semi-automatically, we apply and discuss various approaches to noise removal. Due to the rather large labelset of over 12,000 labels annotated in almost 100,000 provisions in over 60,000 contracts, we believe the corpus to be of interest for research in the field of Legal NLP, (large-scale or extreme) text classification, as well as for legal studies. WWe discuss several methods to sample subcopora from the corpus and implement and evaluate different automatic classification approaches. Finally, Wwe perform transfer experiments to evaluate how well the classifiers perform on contracts stemming from outside the corpus.

1. Introduction

Legal Natural Language Processing is an emerging field that investigates the application of Natural Language Processing (NLP) techniques to the legal domain. Several dedicated conferences and workshops have emerged in recent years (Aletras et al., 2019; Rehm et al., 2018; Palmirani, 2018; Keppens and Governatori, 2017). As a relatively young discipline, Legal NLP lacks resources in several areas. We aim to address this lack regarding text classification by presenting LEDGAR (Labeled EDGAR1), a corpus of labeled provisions in contracts.

Contractual provisions are a primary research target in law studies (Mills, 2019; Knapp et al., 2019; Hershkoff and Kahan, 2018; Fisch, 2018, inter alia), as they are the essential discourse units when drafting, negotiating, validating, or analysing contracts. Each provision in a contract constitutes a legal speech act (Yovel, 2000; Trosborg, 1995; Trosborg, 1991), and the concatenation of the provisions comprises the legal essence of a contract. Meanwhile, an emerging line of research investigates text classification with large-scale or extreme labelsets, where labelsets consist of thousands or even millions of labels (Bengio et al., 2019; Choromanska and Kumar Jain, 2019; Soni et al., 2018, e.g.).

In this light, our main contributions are:

We provide a freely available, substantially-sized corpus (100m+ tokens) that assists the study of provisions, both from a legal and an NLP perspective.
The corpus features a large labelset (12k+ labels), which makes it attractive to research in the field of large-scale (or extreme) multilabel and multiclass classification.
Additionally, we discuss several subsampling techniques and present a method for extracting a label hierarchy based on label names. This hierarchy enables us to reduce the initially large labelset to a more feasible size in a standard text classification setting.
Finally, we demonstrate the corpus’ suitability for training classifiers that are applicable to out-ofdomain contracts, i.e. contracts that stem from outside the corpus.

2. Related work

We focus on related work in two areas, namely work on extracting corpora from EDGAR or providing structured access to information contained in EDGAR, and research on text classification in the legal domain.

Bommarito et al. (2018b) present an open-source framework to extract and store data from EDGAR in a relational database, effectively making working with EDGAR (e.g. finding Exhibit 10 filings) more convenient. They also include example use cases, such as training word embeddings on press releases contained in EDGAR. Bommarito et al. (2018a) released an open-source software library for NLP in the legal domain. The library offers support of basic NLP tasks, such as text segmentation, POS tagging, or named entity recognition. Furthermore, it offers heuristics for the extraction and normalization of addresses, amounts, durations, etc. Additionally, the library contains a broad variety of word and document embeddings pre-trained on different legal data (contract types, provision types etc.), and features a binary, machine learning-based classifier to distinguish contracts from other documents.

Chalkidis et al. (2019) investigate large-scale multi-label document classification in the legal domain by predicting concepts occurring in EU legislative documents. They compile a corpus of 57K documents annotated with 4.3K concepts. They find that BERT (Devlin et al., 2018) produces the best classification Micro F1 scores in almost all settings in comparison to Logistic Regression and several neural classifiers.

Waltl et al. (2019) explore automatic classification of semantic types of legal norms in German laws. They identify, label, and classify nine functional categories (e.g. duty, prohibition, permission) in sentences in the German tenancy law. They transform the texts into TF-IDF representations and found that an SVM classifier performed best in their setting. Waltl et al. (2017) also explore a combination of rule-based classification with active machine learning on this data. Glaser et al. (2018) follow up on this work and explore whether these annotated data are suitable to train classifiers that predict functional categories in rental agreements in a series of experiments similar to ours.

To the best of our knowledge, no prior work is available on provision classification in contracts or on corpora containing provisions labeled in a topical (i.e. not

...

B. Headings. Section and Subsection headings in this Amendment are included herein for convenience of reference only and shall not constitute a part of this Amendment for any other purpose or be given any substantive effect.

C. Applicable Law. THIS AMENDMENT AND THE RIGHTS AND OBLIGATIONS OF THE PARTIES HEREUNDER SHALL BE GOVERNED BY, AND SHALL BE CONSTRUED AND ENFORCED IN ACCORDANCE WITH, THE LAWS OF THE STATE OF NEW YORK.

D. Counterparts. This Amendment may be executed in any number of counterparts (and by different parties hereto in separate counterparts), each of which when so executed and delivered shall constitute an original, but all such counterparts together shall constitute but one and the same instrument; signature pages may be detached from multiple separate counterparts and attached to a single counterpart so that all signature pages are physically attached to the same document. Delivery of an executed counterpart by facsimile or other electronic transmission (e.g., "pdf" or "tif") shall be effective as delivery of a manually executed counterpart of this Amendment.

E. Binding Effect. The execution and delivery of this Amendment by any Lender shall be binding upon each of its successors and assigns (including assignees of its Loans in whole or in part prior to the effectiveness hereof).

F. Waiver of Jury Trial. Each of the parties hereto irrevocably waives trial by jury in any action or proceeding with respect to this Amendment or any other Credit Document.

Figure 1: Excerpt from an Exhbit-10 material contract (https://www.sec.gov/Archives/edgar/data/0001171825/000119312519044328/d691151dex101.htm) showing different markup of potential labels (underlined and in bold) and the accompanying provisions.

...

3. Data

We first describe the data source of our corpus and then outline the extraction of the labeled provisions.

3.1. Data Source

The corpus for the study is comprised of contracts retrieved from the U.S. Securities and Exchange Commission (SEC) website. The SEC hosts a service called EDGAR (Electronic Data Gathering, Analysis, and Retrieval system), through which both domestic and foreign companies conducting business in the U.S.A. are required to submit regular reports. The focus of the crawling was on filings containing material contracts (known as Exhibit-10), such as various agreements, including shareholder, employment, and non-disclosure agreements. This focus was chosen because these material contracts contain various provision types and offer the opportunity for automated label acquisition.

3.2. Crawling and Scraping

We crawled all Exhibit-10 contracts from (including) 2016 to 2019, which yields an initial set of 117,578 contracts from which we heuristically scraped labeled provisions. While not all Exhibit-10 filings follow a standardized HTML markup format, we observed regularities in marking up the names of provisions, as shown in Figure 1. That is, provisions are often prepended with the name of their type, and this type is often displayed in special formatting, such as bold-face or underline, followed by a sentence delimiter, such as a dot or colon. We identified two frequent markup strategies for the provision types (i.e. using either <u> or <font> tags) and heuristically scanned the beginning of paragraphs (identified by <p> or <div> tags) in contracts for occurrences of the pattern:

underlined and/or bold text + delimiter + differently formatted text

We then treated the specially formatted text as the (potential) label, and the non-formatted text as the (potential) provision text.5

To cope with noisy extractions, we applied several filters during the scraping process (e.g. minimum and maximum length of each element in the pattern; first character in text elements must be uppercase; texts cannot consist of stopwords only; labels cannot end in stopwords, etc.).

The approach to navigating EDGAR and the complete set of filters are detailed in the codebase accompanying the paper, and a guide to EDGAR is available at the following URL: https://www.sec.gov/oiea/Article/edgarguide.html.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2020 LEDGARALargeScaleMultiLabelCorp	Don Tuggener Thomas Peetz Mark Cieliebak Pius Von Däniken			LEDGAR: A Large-scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts						2020