DEFT Corpus
A DEFT Corpus is an NLP benchmark corpus for definition extraction.
- Context:
- It can be used by a DEFT Competition, such as DeftEval 2020 (SemEval 2020 - Task 6) [1].
- See: Definition Extraction.
References
2019
- https://github.com/adobe-research/deft_corpus
- QUOTE: Welcome to the largest expertly annotated corpus for complex definition extraction in free text. Pardon our dust - this data is associated withSemEval 2020 Task 6 (DeftEval) and we are releasing the full dataset on the SemEval conference schedule. Train and dev data are available, and test data will become available after the completion of the SemEval evaluation period on 2 Feb 2020. You can source the complete text from the corresponding textbooks at
https://cnx.org
.The most recent version of the corpus was updated on 30 OCT 2019.
- QUOTE: Welcome to the largest expertly annotated corpus for complex definition extraction in free text. Pardon our dust - this data is associated withSemEval 2020 Task 6 (DeftEval) and we are releasing the full dataset on the SemEval conference schedule. Train and dev data are available, and test data will become available after the completion of the SemEval evaluation period on 2 Feb 2020. You can source the complete text from the corresponding textbooks at
2019b
- https://competitions.codalab.org/competitions/20900
- QUOTE: Definition extraction has been a popular topic in NLP research for well more than a decade, but has been historically limited to well defined, structured, and narrow conditions. In reality, natural language is complicated, and complicated data requires both complex solutions and data that reflects that reality. The DEFT corpus expands on these cases to include term-definition pairs that cross sentence boundaries, lack explicit definitors, or definition-like verb phrases (e.g. is, means, is defined as, etc.), or appear in non-hypernym structures.
2019
- (Spala et al., 2019) ⇒ Sasha Spala, Nicholas A. Miller, Yiming Yang, Franck Dernoncourt, and Carl Dockhorn. (2019). “DEFT: A Corpus for Definition Extraction in Free- and Semi-structured Text.” In: Proceedings of the 13th Linguistic Annotation Workshop.
- QUOTE: ...
... The DEFT corpus [1] consists of annotated content from two different data sources: 1) 2,443 sentences (5,324,430 tokens) from various 2017 SEC contract filings from the publicly available US Securities and Exchange Commission EDGAR (SEC) database, and 2) 21,303 sentences (409,253 tokens) from the
https://cnx.org/
open source textbooks (by various authors, licensed under CC BY 4.0) including topics in biology, history, physics, psychology, economics, sociology, and government. 22% of SEC sentences contain definitions and 28% of textbook sentences contain definitions. Our entire corpus, including both datasets, is significantly larger and more complex than any existing definition extraction dataset (see Table 1).
- QUOTE: ...