Text Dataset
Jump to navigation
Jump to search
A text dataset is an unstructured dataset that contains text items.
- AKA: Unstructured Textual Data.
- Context:
- It can be the Input to a Text Analysis Task.
- It can range from being a Small Text Dataset to being a Large Text Dataset (such as Big NLP data).
- Example(s):
- a Text File.
- a Text Corpus (intentionally organized for research).
- an Archive File composed of Text Items.
- a Clinical Text Dataset.
- a Newswire Text Dataset.
- a Web-based Text Dataset.
- …
- Counter-Example(s):
- See: Written Message, Reading Comprehension Dataset, Question-Answer Dataset, Semantic Word Relatedness Dataset, Language Modeling Dataset, Machine Learning Dataset, Benchmark Dataset.
References
2008
- (Jijkoun et al., 2008) ⇒ Valentin Jijkoun, Mahboob Alam Khalid, Maarten Marx, and Maarten de Rijke\n. (2008). “Named Entity Normalization in User Generated Content.” In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data (AND 2008). doi:10.1145/1390749.1390755
- QUOTE: We consider the NEN (named entity normalization) task within the setting of user generated content (UGC), such as blogs, discussion forums, or comments left behind by readers of online documents. For this type of textual data, the NEN task is particularly important within the settings of media and reputation analysis (which motivated the work reported here) and of intelligence gathering.
- (Sarawagi, 2008) ⇒ Sunita Sarawagi. (2008). “Information Extraction.” In: Foundations and Trends in Databases, 1(3). doi:10.1561/1900000003
- QUOTE: The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. ... As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around.