2008 IntroductionToIR
- (Manning et al., 2008) ⇒ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. (2008). “Introduction to Information Retrieval.” Cambridge University Press. ISBN:0521865719.
Subject Headings: Information Retrieval Task, Information Retrieval Algorithm.
Notes
- Table of Contents
01 Boolean retrieval 02 The term vocabulary & postings lists 03 Dictionaries and tolerant retrieval 04 Index construction 05 Index compression 06 Scoring, term weighting & the vector space model 07 Computing scores in a complete search system 08 Evaluation in information retrieval 09 Relevance feedback & query expansion 10 XML retrieval 11 Probabilistic information retrieval 12 Language models for information retrieval 13 Text classification & Naive Bayes 14 Vector space classification 15 Support vector machines & machine learning on documents 16 Flat clustering 17 Hierarchical clustering 18 Matrix decompositions & latent semantic indexing 19 Web search basics 20 Web crawling and indexes 21 Link analysis Bibliography & Index
Quotes
The text classification problem http://nlp.stanford.edu/IR-book/html/htmledition/the-text-classification-problem-1.html
- In text classification, we are given a description $\onedoc \in \mathbb{X}$ of a document, where $\mathbb{X}$ is the document space ; and a fixed set of classes $\mathbb{C} = \{ c_1,c_2,\ldots,c_J \}$. Classes are also called categories or labels . Typically, the document space $\mathbb{X}$ is some type of high-dimensional space, and the classes are human defined for the needs of an application, as in the examples China and documents that talk about multicore computer chips above. We are given a training set $\docsetlabeled$ of labeled documents $\onedoclabeled$, where $\onedoclabeled \in \mathbb{X} \times \mathbb{C}$. For example:
- [math]\displaystyle{ \onedoclabeled=\langle\mbox{Beijing joins the World Trade Organization},\class{China}\rangle }[/math] (111)
for the one-sentence document Beijing joins the World Trade Organization and the class (or label) China.
Using a learning method or learning algorithm, we then wish to learn a classifier or classification function $\gamma $ that maps documents to classes:
- [math]\displaystyle{ \gamma: \mathbb{X} \rightarrow \mathbb{C} }[/math] (112)
This type of learning is called supervised learning because a supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by $\Gamma$ and write $\Gamma(\docsetlabeled) = \gamma$. The learning method $\Gamma$ takes the training set $\docsetlabeled$ as input and returns the learned classification function $\gamma $.
Most names for learning methods $\Gamma$ are also used for classifiers $\gamma $. We talk about the Naive Bayes (NB) learning method $\Gamma$ when we say that “Naive Bayes is robust, meaning that it can be applied to many different learning problems and is unlikely to produce classifiers that fail catastrophically. But when we say that “Naive Bayes had an error rate of 20%, we are describing an experiment in which a particular NB classifier $\gamma $ (which was produced by the NB learning method) had a 20% error rate in an application.
…
Our goal in text classification is high accuracy on test data or new data - for example, the newswire articles that we will encounter tomorrow morning in the multicore chip example. It is easy to achieve high accuracy on the training set (e.g., we can simply memorize the labels). But high accuracy on the training set in general does not mean that the classifier will work well on new data in an application. When we use the training set to learn a classifier for test data, we make the assumption that training data and test data are similar or from the same distribution.
,