Document Classification Task
Jump to navigation
Jump to search
A Document Classification Task is a text item classification task whose input is a document and whose output is a document category from a document category set.
- AKA: Document Categorization Task, Document Sorting Task.
- Context:
- Task Input:
- A Document for classification processing.
- A Document Category Set for category assignment.
- Optional Classification Parameters for process control.
- Task Output:
- A Document Category assignment.
- Classification Confidence Scores for result evaluation.
- Task Measure: Classification Accuracy, Processing Speed, Confidence Level.
- ...
- It can (typically) analyze Document Content for category determination.
- It can (typically) utilize Document Features for classification decision.
- It can (typically) apply Classification Rules for category assignment.
- It can (often) use Document Metadata for classification enhancement.
- It can (often) support Batch Processing for multiple documents.
- ...
- It can range from being a Unstructured Document Classification Task to being a Structured Document Classification Task, depending on document structure.
- It can range from being a Human-Performed Document Classification Task to being a Automated Document Classification Task, depending on automation level.
- It can range from being a Single-Label Classification to being a Multi-Label Classification, depending on label cardinality.
- It can range from being a Simple Classification Task to being a Complex Classification Task, depending on task complexity.
- ...
- It can be supported by a Document Classification System (that implements a Document Classification Algorithm).
- It can be supported by a Document Index Creation Task.
- It can be supported by a Document Feature Extraction System.
- ...
- Task Input:
- Examples:
- Content Classification Tasks:
- Email Filtering Tasks, such as spam classification (for email filtering) and priority classification (for email routing).
- News Classification Tasks, such as topic classification (for news organization) and AG News (for news categorization).
- Professional Classification Tasks:
- Legal Document Classification Tasks, such as contract classification (for document routing) and case classification (for legal processing).
- Medical Document Classification Tasks, such as patient record classification (for record organization) and diagnosis classification (for medical coding).
- Research Classification Tasks:
- Academic Paper Classifications, such as subject classification (for paper organization) and Reuters 90 (for benchmark testing).
- ...
- Content Classification Tasks:
- Counter-Examples:
- Image Classification Task, which processes visual content not textual content.
- Document Search Task, which retrieves rather than categorizes.
- Document Generation Task, which creates rather than classifies.
- Text Clustering Task, which groups without predefined categories.
- See: Document Semantic Parsing Task, Text Classification, Content Categorization, Document Organization.
References
2014
- (Wikipedia, 2014) ⇒ http://en.wikipedia.org/wiki/Document_classification Retrieved:2014-10-31.
- Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.
The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.
Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach.
- Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.