Penn Treebank Project
(Redirected from Penn Treebank Corpus)
Jump to navigation
Jump to search
The Penn Treebank Project is a Research Project to Annotate a large corpus with Syntactic Relations.
- Context:
- It produced the Penn Treebank Corpus.
- See: Annotation Task, Part-of-Speech Annotation Task, Natural Language Parsing.
References
2017
- https://spacy.io/usage/facts-figures#section-benchmarks
- QUOTE: Parse accuracy (Penn Treebank / Wall Street Journal)
This is the "classic" evaluation, so it's the number parsing researchers are most easily able to put in context. However, it's quite far removed from actual usage: it uses sentences with gold-standard segmentation and tokenization, from a pretty specific type of text (articles from a single newspaper, 1984-1989).
- QUOTE: Parse accuracy (Penn Treebank / Wall Street Journal)
2009
- http://www.cis.upenn.edu/~treebank/
- “The Penn Treebank Project annotates naturally-occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. We also annotate text with [tags], and for the Switchboard corpus of telephone conversations,
[annotation]."
2009
- http://www.cis.upenn.edu/~treebank/tokenization.html
- Our tokenization is fairly simple:
- most punctuation is split from adjoining words
- double quotes (") are changed to doubled single forward- and backward- quotes (`` and )
- verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately.
- …
2009
1994
- (Marcus et al., 1994) ⇒ Mitchell P. Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. (1994). “The Penn Treebank: A revised corpus design for extracting predicate argument structure.” In: Human Language Technology, ARPA March 1994 Workshop.
1993
- (Marcus et al., 1993) ⇒ Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. (1993). “Building a large annotated corpus of English: The Penn Treebank.” In: Computational Linguistics, 19(2).