Text Dataset

AKA: Unstructured Textual Data.
Context:
- It can be the Input to a Text Analysis Task.
- It can range from being a Small Text Dataset to being a Large Text Dataset (such as Big NLP data).
Example(s):
- a Text File.
- a Text Corpus (intentionally organized for research).
- an Archive File composed of Text Items.
- a Clinical Text Dataset.
- a Newswire Text Dataset.
- a Web-based Text Dataset.
- …
Counter-Example(s):
- Audio Dataset.
- Video Dataset.
See: Written Message, Reading Comprehension Dataset, Question-Answer Dataset, Semantic Word Relatedness Dataset, Language Modeling Dataset, Machine Learning Dataset, Benchmark Dataset.

References

(Jijkoun et al., 2008) ⇒ Valentin Jijkoun, Mahboob Alam Khalid, Maarten Marx, and Maarten de Rijke\n. (2008). “Named Entity Normalization in User Generated Content.” In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data (AND 2008). doi:10.1145/1390749.1390755
- QUOTE: We consider the NEN (named entity normalization) task within the setting of user generated content (UGC), such as blogs, discussion forums, or comments left behind by readers of online documents. For this type of textual data, the NEN task is particularly important within the settings of media and reputation analysis (which motivated the work reported here) and of intelligence gathering.