Annotated Text Corpus
Jump to navigation
Jump to search
An annotated text corpus is a text corpus with annotated documents.
- Context:
- It can range from being a Syntactically Annotated Corpus to being a Semantically Annotated Corpus.
- It can be produced by an Annotation Task (and range from being a pre-annotated text corpus to being a custom-annotated text corpus).
- It can be a Labeled Record Set.
- It can range from being a Human Annotated Corpus to being a Machine Annotated Corpus.
- It can range from being a Weakly-Annotated Corpus to being a Richly-Annotated Corpus.
- …
- Example(s):
- Counter-Example(s):
- an Annotated Text Item.
- an Unannotated Corpus, such as a Google n-gram corpus.
- an Annotated Image Dataset.
- See: Subject Heading, Document Topic Taxonomy, Labeled Corpus.
References
2008
- http://www-nlp.stanford.edu/links/statnlp.html
- LDC (Linguistic Data Consortium) and its catalogue by year.
Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an LDC Online service for searches over the web (mainly intended for members, but there are samplers available). - European Language Resources Association and its catalogue.
Distribution agency is ELDA. Rapidly growing collection of materials in Europeman languages. - ICAME (International Computer Archive of Modern English)
Sells various corpora (including Brown and London-Lund). Information on corpora on the web, by sending the message help to fileserv@nora.hd.uib.no, by ftp to nora.hd.uib.no. Also, manuals for these corpora. - Reuters @ NIST
Reuters corpora are now distributed by NIST. - TRACTOR
TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora). - CLR (Consortium for Lexical Research)
Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp to clr.nmsu.edu. Their catalog is available as a postscript file. - OTA (Oxford Text Archive)
Provides mainly literary texts. Has a bright new web site. Email: info@ota.ahds.ac.uk. Most materials are available on the web or by anonymous ftp to ota.ox.ac.uk. Some require negotiations with the providers. - Leipzig Corpora Collection
Sentence collections in MySQL database for 17 mainly Europeman languages. - BNC (British National Corpus)
A 100 million word corpus of British English. You can search it online from their simple web interface or via View, a much better interface by Mark Davies, and there is an index to genres by David Lee. And now, an XML edition. - European Corpus Initiative Multilingual Corpus I (ECI/MCI)
A 98 million word corpus, covering most of the major Europeman languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC. - Survey of English Usage
At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from [[ICE-GB and half from London-Lund). - International Corpus of English (ICE)
Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several of them are downloadable from this site. - Corpora held by Lancaster University
This link provides its own annotations. - The European Language Activity Network
Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet. - Talkbank.
Rich video and transcripts.
- LDC (Linguistic Data Consortium) and its catalogue by year.