SemCor Corpus

Context:
- It is available online at: http://web.eecs.umich.edu/~mihalcea/downloads.html#semcor
- It can (typically) be a subset of the English Brown Corpus containing 360,000 words.
- It can (typically) be composed of 352 texts.
- It can (typically) have a SemCor Sense Inventory (likely based on WordNet 1.6 and automatically mapped to subsequent versions of WordNet (1.7, ..., 3.0)).
- It can include Part of Speech Tagging for all Words.
- It can be one of the largest publicly available Sense-Tagged Corpora.
- It can have more than 200,000 content words also sense-tagged according to Princeton WordNet 2.1
- It can have 186 texts with all of the Open Class Words (192,639 nouns, verbs, adjectives, and adverbs) are annotated with POS, lemma, and WordNet synset, while in the remaining 166 texts only verbs (41,497 occurrences) are annotated with lemma and synset.
Example(s):
- SemCor 3.0 (2008-06-13)[1]. “automatically created from SemCor 1.6 by mapping WordNet 1.6 to WordNet 3.0 senses".
- SemCor 2.1 (2006-04-06)[2].
- SemCor 2.0 (2003-10-25)[3].
- SemCor 1.7 (2001-05-25)[4].
- SemCor 1.6 (1997-00-00)[5].
- …
Counter-Example(s):
See: SensEval-2 Benchmark Task.

References

(Bond et al., 2012) ⇒ Francis Bond, Timothy Baldwin, Richard Fothergill, and Kiyotaka Uchimoto. (2012). “Japanese SemCor: A Sense-tagged Corpus of Japanese.” In: Proceedings of the 6th Global WordNet Conference (GWC

http://web.eecs.umich.edu/~mihalcea/downloads.html#semcor (previously http://www.cse.unt.edu/~rada/downloads.html#semcor )
- QUOTE: The POS tags were assigned by the Brill tagger, and the semantic tagging was done manually, using WordNet 1.6 senses.
  Semcor is composed of 352 texts. In 186 texts all of the open class words (192,639 nouns, verbs, adjectives, and adverbs) are annotated with POS, lemma, and WordNet synset, while in the remaining 166 texts only verbs (41,497 occurrences) are annotated with lemma and synset.

(Mihalcea, 1998) ⇒ Rada Mihalcea. (1998). “Semcor semantically tagged corpus." Unpublished manuscript

(Miller et al., 1994) ⇒ George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas. (1994). “Using a Semantic Concordance for Sense Identification.” In: Proceedings of ARPA Human Language Technology Workshop

(Miller et al., 1993) ⇒ George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. (1993). “A Semantic Concordance.” In: Proceedings of the 3 DARPA Workshop on Human Language Technology.
- QUOTE: A semantic concordance is a textual corpus and a lexicon so combined that every substantive word in the text is linked to its appropriate sense in the lexicon. Thus it can be viewed either as a corpus in which words have been tagged syntactically and semantically, or as a lexicon in which example sentences can be found for many definitions. A semantic concordance is being constructed to use in studies of sense resolution in context (semantic disambiguation). The Brown Corpus is the text and WordNet is the lexicon. Semantic tags (pointers to WordNet synsets) are inserted in the text manually using an interface, ConText, that was designed to facilitate the task. Another interface supports searches of the tagged text. Some practical uses for semantic concordances are proposed.