Text Corpus

AKA: Unstructured Document Collection.
Context:
- It can range from being a Very Large Text Corpus, to being a Large Text Corpus to being a Small Text Corpus.
- It can range from being a Monolingual Text Corpus (such as an English corpus), to being a Bilingual Text Corpus to being a Multilingual Text Corpus.
- It can be an Annotated Text Corpus.
- It can be an input to a Text Corpus Mining Task.
Example(s):
Counter-Example(s):
- a Genetic Dataset.
- an Audio File Set.
- an Image File Set.
- a Multimedia Document Corpus.
See: Corpora, Text-based Semantic Annotation, Text Stream.

References

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/Text_corpus Retrieved:2023-7-24.
- In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
  In search technology, a corpus is the collection of documents which is being searched.

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/list_of_text_corpora Retrieved:2015-4-13.
- Following is a list of text corpora in various languages. “Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

↑ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.

(Yao et al., 2009) ⇒ Limin Yao, David Mimno, and Andrew McCallum. (2009). “Efficient Methods for Topic Model Inference on Streaming Document Collections.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). 10.1145/1557019.1557121
- QUOTE: Topic models provide a powerful tool for analyzing large text collections by ... Fitting a topic model given a set of training documents requires ... With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model.