2023 LinearClassifierAnOftenForgotte
- (Lin et al., 2023) ⇒ Yu-Chen Lin, Si-An Chen, Jie-Jyun Liu, and Chih-Jen Lin. (2023). “Linear Classifier: An Often-Forgotten Baseline for Text Classification.” In: arXiv preprint arXiv:2306.07111. doi:10.48550/arXiv.2306.07111
Subject Headings: Supervised Text Classification Algorithm.
Notes
- It emphasizes the competitive performance of linear classifiers in text classification, challenging the dominance of complex models like BERT in certain scenarios.
- It presents a comparative study of linear classifiers and advanced pre-trained language models, showcasing the efficacy of simpler methods in various text datasets.
- It reinforces the importance of using simple baselines like linear classifiers to validate the results obtained from more sophisticated machine learning models.
- It includes a detailed experimental analysis, demonstrating that linear supervised methods such as SVM can outperform BERT-based classification approaches in supervised text classification.
- It revisits previous research on linear SVM and pre-trained language models, providing a fresh perspective on the comparative effectiveness of these methods.
- It underscores the practical implications of choosing linear classifiers, highlighting their efficiency and robustness, especially in resource-constrained environments.
- It concludes with a discussion on the balance between advanced models and simpler methods, advocating for the inclusion of linear classifiers in model evaluation processes.
- Unexpectedly, it focuses on SVMs, not Gradient Boosted Trees.
- Their feature generation best practices include three different operations:
- ngram_range={1,2,3}: unigram, bigram and/or trigram tf-idf features. (though they explored only unigram).
- min_df=5: minimum document frequency of five documents
- max_features=~40000: maximum number of tf-idf features of apprx. 40,000
- Dimensionality reduction is not explored.
Cited By
Quotes
Abstract
Large-scale pre-trained language models such as BERT are popular solutions for text classification. Due to the superior performance of these advanced methods, nowadays, people often directly train them for a few epochs and deploy the obtained model. In this opinion paper, we point out that this way may not always get satisfactory results. We argue the importance of running a simple baseline like linear classifiers on bag-of-words features along with advanced methods. First, for many text data, linear methods show competitive performance, high efficiency, and robustness. Second, advanced models such as BERT may only achieve the best results if properly applied. Simple baselines help to confirm whether the results of advanced models are acceptable. Our experimental results fully support these points.
BODY
...
ngram_range: Specify the range of n-grams to be extracted. For example, LibMultiLabel only uses uni-gram while Chalkidis et al. (2022) set ngram_range to (1, 3) so uni-gram, bi-gram, and tri-gram are extracted into the vocabulary list for a richer representation of the document. min_df
: The parameter is used for removing infrequent tokens. Chalkidis et al. (2022) remove tokens that appear in less than five documents while LibMultiLabel does not remove any tokens. max_features: The parameter decides the number of features to use by term frequency. For example, Chalkidis et al. (2022) consider the top 10,000, 20,000, and 40,000 frequent terms as the search space of the parameter ...
...
Table 6: Data statistics for LexGLUE, the benchmark considered in Chalkidis et al. (2022). W means the average # words per instance of the whole set. The # features indicates the # TF-IDF features used by linear methods ...
...
References
;
Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|
2023 LinearClassifierAnOftenForgotte | Chih-Jen Lin Yu-Chen Lin Si-An Chen Jie-Jyun Liu | Linear Classifier: An Often-Forgotten Baseline for Text Classification | 10.48550/arXiv.2306.07111 | 2023 |