2010 RegularizedMarginbasedCondition

(Jin et al., 2010) ⇒ Xiao-Bo Jin, Cheng-Lin Liu, and Xinwen Hou. (2010). “Regularized Margin-based Conditional Log-likelihood Loss for Prototype Learning.” In: Pattern Recognition Journal, 43(7).In: Pattern Recognition Journal, 43(7). doi:10.1016/j.patcog.2010.01.013

Subject Headings: Regularization Term

Notes

It proposes a Prototype Learning Algorithm that minimizes a conditional log-likelihood loss (CLL) called log-likelihood of margin (LOGM).
It proposes that addition of a Regularization Term to the loss function for avoiding over-ﬁtting as well as to maximize the hypothesis margin.

Cited By

Quotes

Abstract

The classification performance of nearest prototype classifiers largely relies on the prototype learning algorithm. The minimum classification error (MCE) method and the soft nearest prototype classifier (SNPC) method are two important algorithms using misclassification loss. This paper proposes a new prototype learning algorithm based on the conditional log-likelihood loss (CLL), which is based on the discriminative model called log-likelihood of margin (LOGM). A regularization term is added to avoid over-fitting in training as well as to maximize the hypothesis margin. The CLL in the LOGM algorithm is a convex function of margin, and so, shows better convergence than the MCE. In addition, we show the effects of distance metric learning with both prototype-dependent weighting and prototype-independent weighting. Our empirical study on the benchmark datasets demonstrates that the LOGM algorithm yields higher classification accuracies than the MCE, generalized learning vector quantization (GLVQ), soft nearest prototype classifier (SNPC) and the robust soft learning vector quantization (RSLVQ), and moreover, the LOGM with prototype-dependent weighting achieves comparable accuracies to the support vector machine (SVM) classifier.

1. Introduction

Nearest neighbor classiﬁcation [1] is a simple and appealing non-parametric method for pattern classiﬁcation. It does not need any processing of the training data, but involves heavy burden of storage space and computation in classiﬁcation. Prototype learning is to represent the training data with a set of points in feature space, called prototypes. A test point x is then classiﬁed to the class of the closest prototype. To reduce the number of prototypes, many prototype selection and prototype learning methods have been proposed [2–7]. By selecting or synthesizing prototypes that better represent the class distributions or decision boundaries, these methods are also effective to improve the classiﬁcation accuracy. The nearest neighbor classiﬁers with reduced prototypes are also called nearest prototype classiﬁers. They have been widely used in applications such as character recognition [7], text categorization [8], classiﬁcation of mass spectrometry data [9], and so on.

Learning vector quantization (LVQ) [10] is a well known prototype learning algorithm which offers intuitive and simple, yet powerful learning capacity in supervised learning. Kohonen proposed a number of improved versions of LVQ such as LVQ2, LVQ2.1, and LVQ3 [3]. Crammer et al. [11] show that LVQ falls in a family of maximal margin learning algorithms providing a rigorous upper bound of generalization error. Although the LVQ algorithm yields superior classiﬁcation performance, it does not guarantee convergence in training [4]. The performance of LVQ depends on several factors: the initialization of prototypes, the distance metric, and the selection of informative patterns, etc.

To improve the training and generalization performance of LVQ, many extensions or variants have been proposed. In [12], an initialization insensitive LVQ algorithm uses a harmonic average distance instead of the usual nearest-neighbor distance. Sato and Yamada proposed a generalized LVQ (GLVQ) algorithm [4], where prototypes are updated based on a continuous and differentiable loss function. The generalized relevance LVQ (GRLVQ) algorithm [13] introduces the adaptive weighted metric to extend the GLVQ, by adding a prototype-independent weight to each input dimen- sion indicating its relevance. Similarly, the algorithm of Paredes and Vidal [14] learns prototypes and distance metric simulta- neously using a misclassiﬁcation loss function. In [15], the combination of pattern selection and weighted distance norm is explored. It selects an update set composed of points that are considered to be at the risk of being captured by a prototype of a different class.

Many prototype learning algorithms based on loss minimiza- tion have been shown to give higher classiﬁcation accuracies than LVQ [7]. These algorithms include the minimum classiﬁcation error (MCE) method [16], the generalized LVQ (GLVQ) [4], the maximum class probability (MAXP) method [7], the soft nearest prototype classiﬁer (SNPC) [17] and the robust soft learning vector quantization (RSLVQ) [18], etc. The MCE and GLVQ methods minimize margin-based loss functions, while the MAXP and SNPC formulate multiple prototypes in a Gaussian mixture framework and aim to minimize the Bayesian classiﬁcation error on training data. Similarly, the RSLVQ optimizes the conditional log-likelihood loss (CLL) within a Gaussian mixture framework. The misclassiﬁcation loss of [14] is similar to those of MCE and GLVQ. The margin-based algorithms, including LVQ, MCE and GLVQ, adjust only two prototypes on a training sample. The training complexity of the MAXP, SNPC and the RSLVQ method can be similarly decreased by pruning prototype updating according to the proximity of prototypes to the input pattern or using the windowing rule like the LVQ2.1.

In this paper, we investigate into the loss functions of prototype learning algorithms and aim to improve the classiﬁca- tion performance using a new form of loss function. The misclassiﬁcation loss functions of MCE, GLVQ and the one in [14], based on the nearest prototype from the genuine class (correct class) of input pattern and the one from incorrect classes, are smoothed functions of the traditional 0–1 loss, while the direct minimization of the 0–1 loss is computationally intractable [19]. The MAXP algorithm [7], the SNPC method [17] and the RSLVQ [18] approximate the misclassiﬁcation rate in the frame- work of Gaussian mixture. We will discuss the relationships between these algorithms under certain conditions.

We propose a new prototype learning algorithm by minimizing a conditional log-likelihood loss (CLL), called log-likelihood of margin (LOGM). The convexity of margin-based loss in LOGM ensures that there exists a unique maximum margin. We also add a regularization term to the loss function for avoiding [[over-ﬁtting in training]] as well as maximizing the hypothesis margin. The proposed method can be easily extended to a weighted distance metric instead of the Euclidian distance to incorporate the relevance of features. We have extended the LOGM algorithm for prototype learning with both prototype-dependent weighting and prototype- independent weighting. Our empirical study on a large suite of benchmark datasets demonstrates that the LOGM is superior to the MCE, GLVQ, SNPC and RSLVQ methods, and moreover, the LOGM with prototype-dependent weighting achieves comparable accura- cies with the support vector machine (SVM) classiﬁer.

The rest of this paper is organized as follows: Section 2 gives the formulation of prototype learning algorithms; Section 3 briefly reviews the MCE and SNPC methods; Section 4 presents a prototype learning algorithm (LOGM) based on the margin-based conditional log-likelihood loss and extends the LOGM to prototype learning with weighted distance metric; Section 5 discusses the relation- ships between our algorithm with previous ones and their convergence; Section 6 presents our experimental results and the last section concludes the paper. This paper is an extension to our previous conference paper [20] by adding the discussions of relationships with other methods, the analysis of learning conver- gence and generalization, and the extension to prototype learning with weighted distance metric. Throughout the paper, we use many acronyms, which are listed in Table 1 for readers’ convenience.

  Table 1 - List of acronyms
  Acronym	Full name
CLL	Conditional Log-likelihood Loss
GLVQ	Generalized Learning Vector Quantization
GRLVQ	Generalized Relevance LVQ
LOGM	LOG-likelihood of Margin
LVQ	Learning Vector Quantization
MAXP	MAXimum class Probability
MAXP1	A variant of MAXP
MCE	Minimum Classiﬁcation Error
RBF	Radial Basis Function
RSLVQ	Robust Soft Learning Vector Quantization
RSLVQ1	A variant of RSLVQ
SNPC	Soft Nearest Prototype Classiﬁer
SVM	Support Vector Machine.

References

1. Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000
2. Chin-Liang Chang, Finding Prototypes For Nearest Neighbor Classifiers, IEEE Transactions on Computers, v.23 n.11, p.1179-1184, November 1974 doi:10.1109/T-C.1974.223827
3. Kohonen, T., Improved Versions of Learning Vector Quantization. Neural Networks. V1 I17-21. 545-550.
4. A. Sato, K. Yamada, Generalized Learning Vector Quantization, in: Advances in Neural Information Processing Systems, 1995, Pp. 423-429.
5. J. C. Bezdek, T. R. Reichherzer, G. S. Lim, Y. Attikiouzel, Multiple-prototype Classifier Design, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, v.28 n.1, p.67-79, February 1998 doi:10.1109/5326.661091
6. L. I. Kuncheva, J. C. Bezdek, Nearest Prototype Classification: Clustering, Genetic Algorithms, Or Random Search?, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, v.28 n.1, p.160-164, February 1998 doi:10.1109/5326.661099
7. Liu, C.-L. and Nakagawa, M., Evaluation of Prototype Learning Algorithms for Nearest-neighbor Classifier in Application to Handwritten Character Recognition. Pattern Recognition. V34 I3. 601-615.
8. Martí¿n-Valdivia, M.T., Vega, M.G. and López, L.A.U., LVQ for Text Categorization Using a Multilingual Linguistic Resource. Neurocomputing. V55 I3-4. 665-679.
9. P. Schneider, M. Biehl, B. Hammer, Advanced Metric Adaptation in Generalized LVQ for Classification of Mass Spectrometry Data, in: Proceeding of the Workshop on Self Organizing Maps, 2007.
10. T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, K. Torkkola, LVQ PAK: The Learning Vector Quantization Program Package, Technical Report, Helsinki University of Technology, 1995.
11. K. Crammer, R. Gilad-Bachrach, A. Navot, N. Tishby, Margin Analysis of the LVQ Algorithm, in: Advances in Neural Information Processing Systems, 2002, Pp. 462-469.
12. A. K. Qin, P. N. Suganthan, Rapid and Brief Communication: Initialization Insensitive LVQ Algorithm based on Cost-function Adaptation, Pattern Recognition, v.38 n.5, p.773-776, May, 2005 doi:10.1016/j.patcog.2004.11.011
13. Barbara Hammer, Thomas Villmann, Generalized Relevance Learning Vector Quantization, Neural Networks, v.15 n.8-9, p.1059-1068, October 2002 doi:10.1016/S0893-6080(02)00079-5
14. Roberto Paredes, Enrique Vidal, Learning Prototypes and Distances: A Prototype Reduction Technique based on Nearest Neighbor Error Minimization, Pattern Recognition, v.39 n.2, p.180-188, February, 2006 doi:10.1016/j.patcog.2005.06.001
15. Carlos E. Pedreira, Learning Vector Quantization with Training Data Selection, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.28 n.1, p.157-162, January 2006 doi:10.1109/TPAMI.2006.14
16. B.-H. Juang, S. Katagiri, Discriminative Learning for Minimum Error Classification [pattern Recognition], IEEE Transactions on Signal Processing, v.40 n.12, p.3043-3054, December 1992 doi:10.1109/78.175747
17. S. Seo, M. Bode, K. Obermayer, Soft Nearest Prototype Classification, IEEE Transactions on Neural Networks, v.14 n.2, p.390-398, March 2003 doi:10.1109/TNn.2003.809407
18. Sambu Seo, Klaus Obermayer, Soft Learning Vector Quantization, Neural Computation, v.15 n.7, p.1589-1604, July 2003 doi:10.1162/089976603321891819
19. Bartlett, P.L., Jordan, M.I. and McAuliffe, J.D., Convexity, Classification, and Risk Bounds. J. Am. Statist. Assoc. V101 I473. 138-156.
20. X. Jin, C.-L. Liu, X. Hou, Prototype Learning by Margin-based Conditional Log-likelihood Loss, In: Proceedings of the 19th ICPR, Tampa, USA, 2008.
21. Christopher M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, 2006
22. Hermann Ney, On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.17 n.2, p.107-119, February 1995 doi:10.1109/34.368176
23. Daniel Grossman, Pedro Domingos, Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood, Proceedings of the Twenty-first International Conference on Machine Learning, p.46, July 04-08, 2004, Banff, Alberta, Canada doi:10.1145/1015330.1015339
24. Cheng-Lin Liu, H. Sako, H. Fujisawa, Discriminative Learning Quadratic Discriminant Function for Handwriting Recognition, IEEE Transactions on Neural Networks, v.15 n.2, p.430-444, March 2004 doi:10.1109/TNn.2004.824263
25. Baras, J.S. and LaVigna, A., Convergence of Kohonen's Learning Vector Quantization. Neural Networks. V3 I17-21. 17-20.
26. E. B. Kosmatopoulos, M. A. Christodoulou, Convergence Properties of a Class of Learning Vector Quantization Algorithms, IEEE Transactions on Image Processing, v.5 n.2, p.361-368, February 1996 doi:10.1109/83.480771
27. Robbins, H. and Monro, S., A Stochastic Approximation Method. Ann. Math. Statist. V22. 400-407.
28. Léon Bottou, On-line Learning and Stochastic Approximations, On-line Learning in Neural Networks, Cambridge University Press, New York, NY, 1999
29. Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag New York, Inc., New York, NY, 1995
30. Thorsten Joachims, Making Large-scale Support Vector Machine Learning Practical, Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, 1999
31. C. Blake, C. Merz, UCI Machine Learning Repository, University of California Irvine {http://www.ics.uci.edu/mlearn/MLRepository.html}, 1998.
32. K. Lang, Newsweeder: Learning to Filter Netnews, in: Proceedings of the 12th ICML, 1995, Pp. 331-339.
33. Porter, M., An Algorithm for Suffix Stripping. Program. V14 I3. 130-137.
34. Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys (CSUR), v.34 n.1, p.1-47, March 2002 doi:10.1145/505282.505283
35. Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning, p.412-420, July 08-12, 1997
36. Nir Friedman, Dan Geiger, Moises Goldszmidt, Bayesian Network Classifiers, Machine Learning, v.29 n.2-3, p.131-163, Nov./Dec. 1997 doi:10.1023/A:1007465528199
37. Wang, W., Xu, Z., Lu, W. and Zhang, X., Determination of the Spread Parameter in the Gaussian Kernel for Classification and Regression. Neurocomputing. V55 I3-4. 643-663.
38. R. Collobert, S. Bengio, J. Mariéthoz, Torch: A Modular Machine Learning Software Library, Technical Report IDIAP-RR 02-46, IDIAP, 2002.
39. Janez Demšar, Statistical Comparisons of Classifiers over Multiple Data Sets, The Journal of Machine Learning Research, 7, p.1-30, 12/1/2006;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2010 RegularizedMarginbasedCondition	Xiao-Bo Jin Cheng-Lin Liu Xinwen Hou			Regularized Margin-based Conditional Log-likelihood Loss for Prototype Learning				10.1016/j.patcog.2010.01.013		2010