Text Corpus
Jump to navigation
Jump to search
A text corpus is a corpus composed of text items.
- AKA: Unstructured Document Collection.
- Context:
- It can range from being a Very Large Text Corpus, to being a Large Text Corpus to being a Small Text Corpus.
- It can range from being a Monolingual Text Corpus (such as an English corpus), to being a Bilingual Text Corpus to being a Multilingual Text Corpus.
- It can be an Annotated Text Corpus.
- It can be an input to a Text Corpus Mining Task.
- Example(s):
- Counter-Example(s):
- See: Corpora, Text-based Semantic Annotation, Text Stream.
References
2023
- (Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Text_corpus Retrieved:2023-7-24.
- In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
In search technology, a corpus is the collection of documents which is being searched.
- In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/list_of_text_corpora Retrieved:2015-4-13.
- Following is a list of text corpora in various languages. “Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/list_of_text_corpora#English_language Retrieved:2015-4-13.
- Google N-Grams Corpus – Largest English corpus at 155 billion words. [1] Also has corpora for other languages. To download datasets of this corpus, see
- American National Corpus *Bank of English *British National Corpus *Corpus Juris Secundum.
- Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online.
- Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
- International Corpus of English.
- Oxford English Corpus.
- Scottish Corpus of Texts & Speech.
- Corpus Resource Database (CoRD), more than 80 English language corpora.
- ↑ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
2009
- (Yao et al., 2009) ⇒ Limin Yao, David Mimno, and Andrew McCallum. (2009). “Efficient Methods for Topic Model Inference on Streaming Document Collections.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). 10.1145/1557019.1557121
- QUOTE: Topic models provide a powerful tool for analyzing large text collections by ... Fitting a topic model given a set of training documents requires ... With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model.