Text-Item Classification Task
(Redirected from Document categorization)
Jump to navigation
Jump to search
A text-item classification ask is a linguistic classification task whose input is a text item (whose class set is a document category set).
- Context:
- Input: a Text Item Set.
- optional: an Integer of the number of text categories to return.
- optional: a Text Item Classifier.
- optional: a Text Corpus (such as an Annotated Text Corpus)
- output: a Text Category Set.
- performance measures: (see: Classification Task Performance Measure).
- It can be supported by a Text-Item Classification System (that implements a text-item categorization algorithm).
- ...
- It can range from being a Binary Text Classification Task to being a Multiclass Text Classification, depending on the text category set cardinality.
- It can range from being a Single-Label Text Classification Task to being a Multi-Label Text Classification Task, depending on whether each text item can belong to one or multiple categories.
- It can range from being a Small Text Item Classification Task to being a Large Text Item Classification Task (such as a text document classification), depending on the length and complexity of the text item.
- It can range from being a Manual Text Item Classification Task to being an Automated Text Item Classification Task, depending on the extent of human involvement in the classification process.
- It can range from being a Heuristic Text Classification Task to being a Data-Driven Text Classification Task, depending on whether rule-based or statistical methods are used.
- ...
- Input: a Text Item Set.
- Example(s):
- a Simple Binary Classification Task, such as: spam email classification
- a Multiclass Classification Task, such as: webpage type classification and newswire topic classification.
- a Multilabel Classification Task, such as text sentiment classification or topic tagging.
- a Content Type-Specific Text-Item Classification Task, such as webpage type classification or contract classification.
- an Application-Specific Text-Item Classification Task, such as product review classification, spam email classification, or newswire topic classification.
- a Small Text-Item Classification Task, such as sentence classification or phrase classification.
- a Large Text-Item Classification Task, such as text document classification or full report classification.
- a Time-Sensitive Classification Task, such as newswire topic classification or real-time sentiment classification.
- a Fine-Grained Classification Task, such as terminological term classification or medical term classification.
- …
- Counter-Example(s):
- a Text Token Sequence Tagging Task, such as POS Tagging.
- a Text Segmentation Task, such as Text Chunking.
- a Text Segment Classification Task, such as Named Entity Mention Recognition.
- a Topic Modeling Task.
- a Document Clustering.
- See: NLP Task, Semi-Supervised Text Processing, Text Visualization.
References
2011
- (Mladeni; Brank; & Grobelnik, 2011) ⇒ Dunja Mladeni; Janez Brank; Marko Grobelnik. (2011). “Document Classification.” In: (Sammut & Webb, 2011) p.289
2007
- (Thet et al., 2007) ⇒ Tun Thura Thet, Jin-Cheon Na, and Christopher S. G. Khoo. (2007). “Filtering Product Reviews from Web Search Results.” In: Proceedings of the 2007 ACM symposium on Document Engineering.
- NOTES: It compares the performance of a Supervised Learning Algorithm and a Heuristic Approach to a Text Categorization Task that is based on Search Snippets.
- NOTES: The Search Snippets are from Google queries using the format “[product name] review”.
2006
- (Ruch, 2006) ⇒ Patrick Ruch. (2006). “Automatic Assignment of Biomedical Categories: toward a generic approach.” In: Bioinformatics, 2006 Mar 15. doi:10.1093/bioinformatics/bti783.
- QUOTE: To our knowledge the largest set of categories ever used by text classification systems has an order of magnitude of 104. Thus, Yang and Chute (1992) work with the International Classification of Diseases (about 12,000 concepts), while Yang (1999) and Wilbur and Yang (1996) report on experiments conducted with a search space of less than 18,000 Medical Subject Headings (MeSH). To evaluate our system, it is tested using two different benchmarks: 1) the OHSUGEN (Hersh, 2005) collection for the MeSH terminology and 2) the BioCreative data for the Gene Ontology (GO).
2002
- (Sebastiani, 2002) ⇒ Fabrizio Sebastiani. (2002). “Machine Learning in Automated Text Categorization.” In: Association of Computing Machinery Computing Surveys (CSUR), 34(1).
- The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories.
1999a
- (Yang, 1999) ⇒ Y. Yang. (1999). “An Evaluation of Statistical Approaches to Text Categorization.” In: Journal of Information Retrieval, 1.
- NOTE: it experiments on a search space of ~18,000 Medical Subject Headings (MeSH).
1999b
- (McCallum, 1999) ⇒ Andrew McCallum. (1999). “Multi-label Text Classification with a Mixture Model Trained by EM.” In: AAAI 99 Workshop on Text Learning.
- QUOTE: In many important document classification tasks, documents may each be associated with multiple class labels. ... Text classification is the problem of assigning a text document into one or more topic categories or classes. In multiclass document classification, as distinguished from binary document classification, there are more than two classes. In multi-label classification each document may have more than one class label. For example, given classes N. America, S. America, Europe, Asia and Australia, a news article about U.S. troops in Bosnia may be labeled with both the N. America and Europe classes.
1998
- (Dumais et al., 1998) ⇒ Susan T. Dumais, John C. Platt, David Heckerman, and Mehran Sahami. (1998). “Inductive Learning Algorithms and Representations for Text Categorization.” In: Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM 1998).
- Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. Its most widespread application to date has been for assigning subject categories to documents to support text retrieval, routing and filtering.
1996
- (Wilbur & Yang, 1996) ⇒ J. Wilbur, and Y. Yang. (1996). “Analysis of Statistical Term Strength and its Use in the Indexing and Retrieval of Molecular Biology Texts.” In: Comput. Biol. Med., 26(3), 209–222.
- experiment on a search space of less than 18,000 Medical Subject Headings (MeSH).
1992
- (Yang & Chute, 1992) ⇒ Y. Yang, and C. Chute. (1992). “A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts.” In: COLING 1992.
- Work with the International Classification of Diseases (about 12,000 concepts)
1975
- (Field, 1975) ⇒ B. J. Field. (1975). “Towards Automatic Indexing: Automatic assignment of controlled-language indexing and classification from free indexing.” In: : Journal of Documentation, 31(4). doi:10.1108/eb026605
1963
- (Borko & Bernick, 1963) ⇒ Harold Borko, and Myrna Bernick. (1963). “Automatic Document Classification.” In: Journal of the ACM (JACM).
- The problem of automatic document classification is a part of the larger problem of automatic content analysis. Classification means the determination of subject content. For a document to be classified under a given heading, it must be ascertained that its subject matter relates to that area of discourse. In most cases this is a relatively easy decision for a human being to make. The question being raised is whether a computer can be programmed to determine the subject content of a document and the category (categories) into which it should be classified.