2008 BoostingSupVectMachforImbDataSets
- (Wang & Japkowicz, 2008) ⇒ Benjamin X. Wang, and Nathalie Japkowicz. (2008) "Boosting Support Vector Machines for Imbalanced Data Sets.” In: Proceedings of the 17th International Conference on Foundations of Intelligent Systems. ISBN:3-540-68122-1 978-3-540-68122-9.
Subject Headings:
Notes
Cited By
Quotes
Abstract
Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. Then we use a boosting algorithm to get an ensemble classifier that has lower error than a single classifier. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class.
References
- Akbani, R., Kwek, S., and Japkowicz, N. (2004), Applying Support Vector Machines to Imbalanced Datasets,in the Proceedings of the 2004 European Conference on Machine Learning (ECML’2004).
- Amari,S.,& Wu,S.(1999). Improving support vector machine classifiers by modifying kernel functions. Neural Networks,12,783-789.
- N. Chawla, K. Bowyer, L. Hall, & W. P. Kegelmeyer, (2000). SMOTE: synthetic minority over-sampling technique. International Conference on Knowledge Based Computer Systems.
- N. Chawla, A. Lazarevic, L. Hall, K. Bowyer (2003). SMOTEBoost: Improving Prediction of the Minority Class in Boosting, 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia , 107-119.
- Chen C., Liaw, A., and Breiman, L. (2004). Using random forest to learn unbalanced data. Technical Report 666, Statistics Department, University of California at Berkeley.
- W. Fan, S. Stolfo, J.Zhang, P. Chan (1999). AdaCost: Misclassification CostSensitive Boosting, Proceedings of 16th International Conference on Machine Learning, Slovenia.
- H. Guo and HL Viktor (2004). Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach, ACM SIGKDD Explorations, 6(1), 30-39.
- N. Japkowicz and S. Stephen (2002). The Class Imbalance Problem: A Systematic Study: Intelligent Data Analysis, Volume 6, Number 5, pp. 429-450
- Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. Proceddings of the Fourteenth International Conference on Machine Learning, 179-186.
- Katharina Morik, Peter Brockhausen, Thorsten Joachims (1999). Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring. ICML: 268-277
- Shawe-Taylor, J. and Cristianini, N. (1999) Further results on the margin distribution. In: Proceedings of the 12th Conference on Computational Learning Theory.
- Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
- Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. Proceedings of the International Joint Conference on Artificial Intelligence, 55-60.
- Wu, G., & Chang, E. (2003). Adaptive feature-space conformal transformation for imbalanced data learning. Proceedings of the 20th International Conference on Machine Learning.,