Legal Document Corpus
A Legal Document Corpus is a domain-specific document corpus of legal documents (such as contracts and legal judgments).
- AKA: Legal Corpora.
- Context:
- It can be used for research and development in the fields of Legal NLP, legal analytics, and AI-driven legal solutions.
- It can serve as a foundation for tasks like contract analysis, legal prediction, or document-level inference.
- It can range from being a Real-World Legal Corpus to a Synthetic Legal Corpus.
- It can range from being a Large Legal Corpus to being a Small Legal Corpus, based on its corpus size.
- It can range from being an Unannotated Legal Corpus or being an Annotated Legal Corpus, depending on the presence of annotated legal documents.
- It can be developed through a Legal Corpus Creation Task.
- It can be managed through a Legal Corpus Management Task.
- ...
- Example(s):
- ContractNLI by Koreeda & Manning (2021) - A corpus designed for document-level natural language inference for contracts.
- A corpus for structuring legal judgment documents.
- HLDC: A Hindi legal documents corpus.
- ILDC for CJPE: A corpus focusing on Indian legal documents for court judgment prediction and explanation.
- a Court Listener-based Corpora, such as FreeLaw-based in The Pile.
- ...
- Counter-Example(s):
- a Biomedical Corpus.
- a generic English text corpus.
- a collection of fiction novels.
- See: Legal NLP, Legal Document Management, Domain-Specific Corpus, Legal Analytics, Benchmark Dataset.
References
2023
- (GitHub, 2023) ⇒ https://github.com/neelguha/legal-ml-datasets
- This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.
This page is continually being updated. If I missed something, please contact me at (Neel Guha) nguha@stanford.edu and I'll add it!
- Task agnostic datasets: These datasets can be used for pretraining larger models. Alternatively, you cause them to construct artificial tasks.
- Caselaw Access Project(https://case.law/): all official, book-published United States case law.
- Legifrance(https://www.legifrance.gouv.fr/): a French legal publisher providing access to law codes and legal decisions. Requires scraping ([Paper](http://ceur-ws.org/Vol-2645/paper2.pdf)).
- US Supreme Court Database(http://scdb.wustl.edu/): information about every case decided by the US Supreme Court between 1791 and today.
- European Parliment Proceedings(https://www.statmt.org/europarl/): Parallel text of the proceedings of the European Parliment, collected in 11 languages.
- US Code(https://uscode.house.gov/download/download.shtml): downloadable version of the US Code in XML format
- Patent Litigation Docket Reports(https://www.uspto.gov/learning-and-resources/electronic-data-products/patent-litigation-docket-reports-data): detailed patent litigation data on over 80k unique district court cases
- Pile of Law(https://huggingface.co/datasets/pile-of-law/pile-of-law): a 256GB dataset of legal, administrative, and contractual texts.
- Open Australian Legal Corpus(https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus): The first and only multijurisdictional open corpus of Australian legislative and judicial documents.
- Benchmarks which combine multiple types of tasks
- LexGlue(https://huggingface.co/datasets/lex_glue): a GLUE inspired set of legal tasks
- LegalBench(https://github.com/HazyResearch/legalbench): a large language model benchmark for legal reasoning
- Judgement prediction: Training a model to predict the outcome of a case from various case specific features.
- European Court of Human Rights(https://archive.org/details/ECHR-ACL2019): 11.5k cases from ECHR's public database. [Paper](https://www.aclweb.org/anthology/P19-1424/).
- Document/contract annotation: Training a model to annotate sentences/clauses/sections in a contract (or other document) according to various criteria (e.g. unfairness, argument structure, etc).
- Detecting unfair clauses from online terms-of-service(http://155.185.228.137/claudette/ToS.zip): ~12k sentences from 50 terms-of-service agreements. [Paper](https://arxiv.org/pdf/1805.01217.pdf).
- Usable Privacy Project Data(https://usableprivacy.org/data): a collection of datasets for privacy policies, including OPP-115, APP-350, MAPS, and the ACL/COLING 2014 Dataset.
- Contract extraction dataset(http://nlp.cs.aueb.gr/software_and_datasets/CONTRACTS_ICAIL2017/index.html): 3,500 English contracts manually annotated with 11 different contract elements. [Paper](http://nlp.cs.aueb.gr/pubs/icail2017.pdf).
- EURLEX with EUROVOC annotations(http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/index.html): 57k legilsative documents from the EU's public document database, annotated with concepts from EUROVOC. [Paper](https://www.aclweb.org/anthology/W19-2209/).
- Cornell eRulemaking Corpus(https://facultystaff.richmond.edu/~jpark/data/jpark_lrec18.zip): Collection of 731 user comments on the the Consumer Debt Collection Practices rule by the CFPB, with annotations containing information about argument structure. [Paper](https://facultystaff.richmond.edu/~jpark/papers/jpark_lrec18.pdf).
- German rental agreements (in English)(https://github.com/sebischair/Legal-Sentence-Classification-Datasets-and-Models): ~913 sentences from German rental agreements annotated by semantic type. [Paper](https://www.researchgate.net/publication/332171940_Classifying_Semantic_Types_of_Legal_Sentences_Portability_of_Machine_Learning_Models).
- Segmenting US court decision opinions into issue parts(https://github.com/jsavelka/us-dec-func-iss-sgm/blob/master/trade_secret_cases.json): 316 court decisions on cyber crime and trade secrets, manually segmented into 6 content based "types" (encompassing categories like "Introduction", "Dissent", or "Background"). [Paper](http://ebooks.iospress.nl/volumearticle/50840)
- ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts(https://arxiv.org/abs/2110.01799)
- This is a collection of pointers to datasets/tasks/benchmarks pertaining to the intersection of machine learning and law.
2022
- (Kalamkar et al., 2022) ⇒ P. Kalamkar, A. Tiwari, A. Agarwal, S. Karn... (2022). “[Corpus for automatic structuring of legal documents](https://arxiv.org)”. In: arXiv preprint arXiv.
- QUOTE: ... for processing and organizing legal documents. In this paper, we introduce a ... corpus for structuring legal documents. In particular, we introduce a corpus of legal judgment documents in ...
2022
- (Kapoor et al., 2022) ⇒ A. Kapoor, M. Dhawan, A. Goel, T.H. Arjun... (2022). “[HLDC: Hindi legal documents corpus](https://arxiv.org)”. In: arXiv preprint arXiv.
- QUOTE: ... growing corpus as we continue to add more legal documents to HLDC. We release the corpus ... As a use-case for applicability of the corpus for developing legal systems, we propose the ...
2021
- (Malik et al., 2021) ⇒ V. Malik, R. Sanjay, S.K. Nigam, K. Ghosh... (2021). “[ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation](https://arxiv.org)”. In: arXiv preprint arXiv.
- QUOTE: ... In this paper, we introduce INDIAN LEGAL DOCUMENTS CORPUS (ILDC) intending to ... could assist in legal case judgment prediction in an explainable way. ILDC is a corpus of case ...
2021
- (Koreeda & Manning, 2021) ⇒ Yuta Koreeda, and Christopher D. Manning. (2021). “ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts.” In: Findings of the Association for Computational Linguistics: EMNLP 2021.
- ABSTRACT: Reviewing contracts is a time-consuming procedure that incurs large expenses to companies and social inequality to those who cannot afford it. In this work, we propose "document-level natural language inference (NLI) for contracts", a novel, real-world application of NLI that addresses such problems. In this task, a system is given a set of hypotheses (such as "Some obligations of Agreement may survive termination.") and a contract, and it is asked to classify whether each hypothesis is "entailed by", "contradicting to" or "not mentioned by" (neutral to) the contract as well as identifying "evidence" for the decision as spans in the contract. We annotated and release the largest corpus to date consisting of 607 annotated contracts. We then show that existing models fail badly on our task and introduce a strong baseline, which (1) models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens, and (2) employs more sophisticated context segmentation for dealing with long documents. We also show that linguistic characteristics of contracts, such as negations by exceptions, are contributing to the difficulty of this task and that there is much room for improvement.
- QUOTE: Helped by their accessibility, there exist multiple prior works on legal NLI. One of the subtasks in COLIEE-2020 shared task (Rabelo et al., 2020) was, given a court decision Q and relevant cases, to extract relevant paragraphs from the cases and to classify whether those paragraphs entail “Q” or “not Q”. Holzenberger et al. (2020) introduced a dataset for predicting an entailment relationship between a statement and a statute excerpt. While they are both “legal” and “NLI”, statutes and contracts exhibit different characteristics including the fact that statutes/cases tend to be written in consistent vocabulary and styles. Moreover, there only exists a single right answer for a hypothesis in case/statute law NLI, whereas a hypothesis can be entailed by or contradicting to each contract in our task; i.e., hypotheses and documents have one-to-one relationships in case/statute law NLI, but they have many-to-many relationships in our task.
As discussed in Section 1, our task has practical and scientific significance compared to information extraction for contracts (Leivaditi et al., 2020; Hendrycks et al., 2021). We showed in our experiments that the NLI part of our task is much more challenging than the evidence identification task. Furthermore, we gave observations to linguistic characteristics of our dataset that are lacking in these prior works.
Lippi et al. (2019) presented a dataset where certain types of contract clauses are identified and annotated with “clearly fair”, “potentially unfair” or “clearly unfair”. While the format of the task input and output is quite similar, our task requires reasoning over a much diverse set of hypotheses than just fair or unfair. Similarly, fact extraction and claim verification tasks (Thorne et al., 2018; Jiang et al., 2020), where the task is to extract facts from Wikipedia articles and to classify whether the claim is entailed by the facts, have similar input and output formats. ...