Penn Treebank Project

From GM-RKB
(Redirected from Penn Treebank Corpus)
Jump to navigation Jump to search

The Penn Treebank Project is a Research Project to Annotate a large corpus with Syntactic Relations.



References

2017

2009

  • http://www.cis.upenn.edu/~treebank/
    • The Penn Treebank Project annotates naturally-occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. We also annotate text with [tags], and for the Switchboard corpus of telephone conversations,

[annotation]."

2009

    • http://www.cis.upenn.edu/~treebank/tokenization.html
    • Our tokenization is fairly simple:
      • most punctuation is split from adjoining words
      • double quotes (") are changed to doubled single forward- and backward- quotes (`` and )
      • verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately.

2009

1994

  • (Marcus et al., 1994) ⇒ Mitchell P. Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. (1994). “The Penn Treebank: A revised corpus design for extracting predicate argument structure.” In: Human Language Technology, ARPA March 1994 Workshop.

1993