OntoNotes Corpus
Jump to navigation
Jump to search
An OntoNotes Corpus is a large manually-annotated corpus that comprises several text genres with syntactic structure and shallow semantics.
- Context:
- It is developed by a Collaborative Project that includes: BBN Technologies, Information Sciences Institute of University of Southern California, University of Colorado, University of Pennsylvania and Brandeis University.
- It can be supported by the GALE Program.
- It can include Text from various genres of text, including: news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, and talk shows.
- It can include Text in three languages: English, Chinese, and Arabic.
- It can provide a skeletal representation of the literal meaning of sentences:
- It can include Structural Linguistic Annotation:
- It can include Shallow Semantic Annotation:
- coreference mention.
- word senses linked to an ontology, linked to the Omega Ontology.
- Example(s):
- Counter-Example(s):
- See: Ontonotes-5-Parsing Algorithm, Corpus, Treebank, PropBank, Ontology, Word Sense Annotation, Question Answering Task, Machine Translation.
References
2017
- (Weischedel et al., 2017) ⇒ Ralph Weischedel, Eduard Hovy, Mitchell Marcus, Martha Palmer, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue (2017). "OntoNotes: A Large Training Corpus for Enhanced Processing.
- QUOTE: Our goal is to provide data in multiple languages and multiple genres (newswire, broadcast news, broadcast conversation, and web text), richly annotated by a skeletal representation of the literal meaning of sentences, so that a new generation of language understanding would deliver new functional capability. (...).
As shown in Figure 1, to the baseline structure of parse trees and propositions, OntoNotes adds:
- Referring expressions and the textual phrases they refer to;
- Terms disambiguated by word sense and localized in an ontology;
- Named Entities.
- QUOTE: Our goal is to provide data in multiple languages and multiple genres (newswire, broadcast news, broadcast conversation, and web text), richly annotated by a skeletal representation of the literal meaning of sentences, so that a new generation of language understanding would deliver new functional capability. (...).
2012
- (Weischedel et al., 2012) ⇒ Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Nianwen Xue, Martha Palmer, Jena D. Hwang, Claire Bonial, Jinho Choi, Aous Mansouri, Maha Foster and Abdel-aati Hawwary, Mitchell Marcus, Ann Taylor, Craig Greenberg, Eduard Hovy, Robert Belvin, and Ann Houston (2012). "OntoNotes Release 5.0 with OntoNotes DB Tool v0.999 beta".
- QUOTE: The OntoNotes project was a collaborative effort between BBN Technologies, Brandeis University, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences. The goal was to annotate a large corpus comprising various genres (news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation adds coreference to PropBank, and includes partial word sense disambiguation for some nouns and verbs, with the word senses connected to an ontology. OntoNotes includes roughly 1.5 million words of English, 800K of Chinese, and 300 K of Arabic. More details are provided in Weischedel et al. (2011).
2011
- (Weischedel et al., 2011) ⇒ Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, Ann Houston. (2011). “OntoNotes Release 4.0.” In: LDC Catalog. ISBN:1-58563-574-X
- QUOTE: … This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire; 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text; and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.
2010
- (Pradhan & Xue, 2010) ⇒ Sameer S. Pradhan and Nianwen Xue. (2010). “OntoNotes: The 90% Solution." Tutorial/Demo at HLT-NAACL 2010.
- QUOTE: OntoNotes is a five year multi-site collaboration between BBN Technologies, Information Sciences Institute of University of Southern California, University of Colorado, University of Pennsylvania and Brandeis University. The goal of the OntoNotes project is to provide linguistic data annotated with a skeletal representation of the literal meaning of sentences including syntactic parse, predicate-argument structure, coreference, and word senses linked to an ontology, allowing a new generation of language understanding technologies to be developed with new functional capabilities.
2009a
- (Raytheon BBN Technologies, 2009) ⇒ http://www.bbn.com/ontonotes/
- QUOTE: The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to produce such a resource. It aims to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. Over the course of the five-year program, our current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic.
2009b
- (Finkel & Manning, 2009) ⇒ Jenny Rose Finkel, and Christopher D. Manning. (2009). “Joint Parsing and Named Entity Recognition.” In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (ACL 2009).
2007
- (Pradhan et al., 2007) ⇒ Sameer S. Pradhan, Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, Ralph Weischedel. (2007). “OntoNotes: A Unified Relational Semantic Representation.” In: Proceedings of the International Conference on Semantic Computing (ICSC 2007). doi:10.1109/ICSC.2007.83
- OntoNotes Corpus, Annotation Format.
- ABSTRACT: The OntoNotes project is creating a corpus of large-scale, accurate, and integrated annotation of multiple levels of the shallow semantic structure in text. Such rich, integrated annotation covering many levels will allow for richer, cross-level models enabling significantly better automatic semantic analysis. At the same time, it demands a robust, efficient, scalable mechanism for storing and accessing these complex inter-dependent annotations. We describe a relational database representation that captures both the inter- and intra-layer dependencies and provide details of an object-oriented API for efficient, multi-tiered access to this data.
2006
- (Hovy et al., 2006) ⇒ Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. (2006). “OntoNotes: the 90% solution.” In: Proceedings of the Human Language Technology Conference of the NAACL (HLT-NAACL 2006).