Text Topic Modeling Task

Context:
- Input: a Text Corpus.
  - optional: A Document Topic Annotated Corpus (a type of annotated corpus).
  - optional: A number of document topics sought: k.
  - optional: One or more time interval(s): t
- output: a document topic set.
  - optional: a Document Topic Model that can be used to classify a new document to one or more of the document topics.
- It can range from being a Text-only Topic Modeling Task to being
- It can be solved by a Text Topic Modeling System (that implements a text topic modeling algorithm.
Example(s):
- TDT-2 Benchmark Task.
- …
Counter-Example(s):
- a Topic Modeling with Network Structure Task.
- an Author Topic Modeling Task.
- a Text Categorization Task.
- a Topic Tracking Task/Trend Detection in Text.
- a Topic Modeling Algorithm (can solve the task).
- an Opinion Mining Task.
- a Language Modeling Task.
See: Topic Analysis Task, Text Mining, Document Modeling.

References

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Topic_model Retrieved:2015-4-25.
- In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
  Although topic models were first described and implemented in the context of natural language processing, they have applications in other fields such as bioinformatics.

(Blei, 2008) ⇒ David M. Blei. (2008). “Modeling Science." Presentation. April 17, 2008

(Newman et al., 2006) ⇒ David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. (2006). “Statistical Entity-Topic Models.” In: Proceedings of SIGKDD-2006.

(Griffiths & Steyvers, 2004) ⇒ Thomas L. Griffiths, and Mark Steyvers. (2004). “Finding Scientific Topics.” In: PNAS, 101(Suppl. 1). doi:10.1073/pnas.0307752101
- A first step in identifying the content of a document is determining which topics that document addresses.

(Blei, Ng & Jordan, 2003) ⇒ David M. Blei, Andrew Y. Ng , and Michael I. Jordan. (2003). “Latent Dirichlet Allocation.” In: The Journal of Machine Learning Research, 3.
- In this paper we consider the problem of modeling text corpora and other collections of discrete data. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments.
(Liu et al., 2003) ⇒ Bing Liu, Chee Wee Chin, and Hwee Tou Ng. (2003). “Mining Topic-Specific Concepts and Definitions on the Web.” In: Proceedings of the 12th International Conference on World Wide Web (WWW 2003). doi:10.1145/775152.775188

(Allan, 2002) ⇒ James Allan. (2002). “Introduction to Topic Detection and Tracking.” In: James Allan, editor. “Topic Detection and Tracking: Event-based Information Organization." Kluwer International. ISBN:0792376641

(Allan et al., 1998) ⇒ James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. (1998). “Topic Detection and Tracking Pilot Study: Final report.” In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop.