2008 FastLogisticRegressionforTextCa

(Ifrim et al., 2008) ⇒ Georgiana Ifrim, Gökhan Bakir, and Gerhard Weikum. (2008). “Fast Logistic Regression for Text Categorization with Variable-length N-grams.” In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008). doi:10.1145/1401890.1401936

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Abstract

A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all ngrams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2008 FastLogisticRegressionforTextCa	Gerhard Weikum Georgiana Ifrim Gökhan Bakir			Fast Logistic Regression for Text Categorization with Variable-length N-grams				10.1145/1401890.1401936