1997 AComparativeStudyOnFeatureSel
- (Yang & Pedersen, 1997) ⇒ Yiming Yang, and Jan O. Pedersen. (1997). “A Comparative Study on Feature Selection in Text Categorization.” In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997).
Subject Headings: Feature Selection.
Notes
Cited By
- ~2,958 http://scholar.google.com/scholar?q=%22A+Comparative+Study+on+Feature+Selection+in+Text+Categorization%22+1997
- ~665 http://portal.acm.org/citation.cfm?id=657137#citedby
Quotes
Abstract
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a Χ^2 (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a k-nearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision). DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 50% vocabulary reduction but is not competitive at higher vocabulary reduction levels. In contrast, MI had relatively poor performance due to its bias towards favoring rare terms, and its sensitivity to probability estimation errors.
,
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
1997 AComparativeStudyOnFeatureSel | Yiming Yang Jan O. Pedersen | A Comparative Study on Feature Selection in Text Categorization | Proceedings of the Fourteenth International Conference on Machine Learning | http://nyc.lti.cs.cmu.edu/yiming/Publications/icml97.ps | 1997 |