2004 InDefenseofOneVsAllClassificati

From GM-RKB
(Redirected from Rifkin & Klatau, 2004)
Jump to navigation Jump to search

Subject Headings: Multiclass Classification Algorithm, One-vs-Rest Multiclass Supervised Classification Algorithm, Regularized Supervised Classification Algorithm.

Notes

Cited By

Quotes

Author Keywords

Multiclass Classification, Regularization

Abstract

We consider the problem of multiclass classification. Our main thesis is that a simple "one-vs-all" scheme is as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support vector machines. This thesis is interesting in that it disagrees with a large body of recent published work on multiclass classification. We support our position by means of a critical review of the existing literature, a substantial collection of carefully controlled experimental work, and theoretical arguments.

1. Introduction

We consider the problem of multiclass classification. A training set consisting of data points belonging to [math]\displaystyle{ N }[/math] different classes is given, and the goal is to construct a function which, given a new data point, will correctly predict the class to which the new point belongs. [1]

Over the last decade, there has been great interest in classifiers that use regularization to control the capacity of the function spaces they operate in. These classifiers - the best-known example of which is the support vector machine (SVM) (Boser et al., 1992) - have proved extremely successful at binary classification tasks (Vapnik, 1998, Evgeniou et al., 2000, Rifkin, 2002). It therefore seems interesting to consider whether the advantages of regularization approaches for binary classifiers carried over to the multiclass situation. One of the simplest multiclass classification schemes built on top of real-valued binary classifiers is to train [math]\displaystyle{ N }[/math] different binary classifiers, each one trained to distinguish the examples in a single class from the examples in all remaining classes. When it is desired to classify a new example, the [math]\displaystyle{ N }[/math] classifiers are run, and the classifier which outputs the largest (most positive) value is chosen. This scheme will be referred to as the “one-vs-all” or OVA scheme throughout this paper. The one-vs-all scheme is conceptually simple, and has been independently discovered numerous times by different researchers.

One might argue that OVA is the first thing thought of when asked to come up with an approach for combining binary classifiers to solve multiclass problems. Although it is simple and obvious, the primary thesis of this paper is that the OVA scheme is extremely powerful, producing results that are often at least as accurate as other methods. This thesis seems quite innocuous and hardly worth writing a paper about, until one realizes that this idea is in opposition to a large body of recent literature on multiclass classification, in which a number of more complicated methods have been developed and their superiority over OVA claimed. These methods can be roughly divided between two different approaches - the " single machine " approaches, which attempt to construct a multiclass classifier by solving a single optimization problem (Weston and Watkins, 1998, Lee et al., 2001a, b, Crammer and Singer, 2001) and the " error correcting " approaches (Dietterich and Bakiri, 1995, Allwein et al., 2000, Crammer and Singer, 2002, FÄurnkranz, 2002, Hsu and Lin, 2002), which use ideas from error correcting coding theory to choose a collection of binary classifiers to train and a method for combining the binary classifiers.

A substantial portion of this paper is devoted to a detailed review and discussion of this literature. What we find is that although a wide array of more sophisticated methods for multiclass classification exist, experimental evidence of the superiority of these methods over a simple OVA scheme is either lacking or improperly controlled or measured. One scheme that is particularly worthy of attention is the “all-pairs", or AVA ("all-vs-all") scheme. In this approach, (N over 2) binary classifiers are trained; each classifier separates a pair of classes. This scheme, like the OVA scheme, has a simple conceptual justification, and can be implemented to train faster and test as quickly as the OVA scheme. Several authors have reported that the AVA scheme offers better performance than the OVA scheme (Allwein et al., 2000, FÄurnkranz, 2002, Hsu and Lin, 2002). Our results disagree with the ones presented in all three of these papers, essentially because we feel their experiments were not as carefully controlled and reported as ours.

For an experiment to be carefully controlled means that a reasonable effort was made to find correct settings for the hyperparameters of the algorithm (in a way that does not involve mining the test set, obviously) and that the best available binary classifiers are used. This point is critical; it is easy to show (and many authors have) that if relatively weak binary learners are used, then a wide variety of clever methods for combining them will exploit the independence in the error rates of these weak classifiers to improve the overall result. In contrast, we demonstrate empirically in a substantial collection of well-controlled experiments that when well-tuned SVMs (or regularized least squares classifiers) are used as the binary learners, there is little to no independence in the errors of the binary classifiers, and therefore nothing to be gained from sophisticated methods of combination. The crucial question for the practitioner then becomes whether sophisticated methods of combining weak classifiers can achieve stronger results than a simple method of combining strong classifiers. The empirical evidence supports the notion that they cannot.

For an experiment to be carefully reported indicates that the results are presented in a way that is easy to understand, and that reasonable conclusions are drawn from them. This is obviously a subjective notion, but we wish to point out a specific area which we feel is problematic: the notion that a lower absolute error rate is strongly indicative of the superiority of a classifier when these absolute error rates are very close. In other words, we feel it is not appropriate simply to present results where the best-performing classifier has its score given in bold on each experiment, and the classifier with the most bold scores is declared the strongest, because this ignores the very real possibility (see for example Hsu and Lin, 2002) that on nearly all the experiments, the actual experimental differences are tiny. Therefore, it is worthwhile to assess statistically the relative performance of classification systems.

One possible test for comparing two schemes is McNemar's test (McNemar, 1947, Everitt, 1977) (see Appendix A). A di±culty with McNemar's test is that it is insensitive to the number of points that the two schemes agree on; directly related to this, McNemar's test simply tells whether or not two scores are statistically significantly different (according to the assumptions inherent in the test), but gives no indication of how different they are. For this reason, we advocate and use a simple bootstrap method for computing confidence intervals of the difference in performance between a pair of classifiers. This method is described in Appendix A.2

Additionally, in order to allow for comparisons by different researchers, we feel that it is crucial to present the actual error rates of various classification schemes, rather than (or in addition to) the relative differences in error between two schemes (see, for example, Dietterich and Bakiri, 1995, Crammer and Singer, 2001, 2002); this is particularly important given that different researchers are often unable to produce identical results using identical methodologies (see Appendix B).

In general, we are not stating that an OVA scheme will perform substantially better than other approaches. Instead, we are stating that it will perform just as well as these approaches, and therefore it is often to be preferred due to its computational and conceptual simplicity.

2. Regularized Kernel Classifiers

A broad class of classification (and regression) algorithms can be derived from the general approach of Tikhonov regularization. In our case, we consider Tikhonov regularization in a reproducing kernel Hilbert space:</math>min f2H ` Xi=1 V (f(xi); yi) + ¸jjfjj2 K:</math>

Here, [math]\displaystyle{ l }[/math] is the size of the training set [math]\displaystyle{ S ´ f(x1; y1); : : : ; (x`; y`)g. V (f(x; y)) }[/math] represents the cost we incur when the algorithm sees x, predicts f(x), and the actual value is y. The parameter, λ, is the "regularization parameter" which controls the tradeoff between the two conflicting goals of minimizing the training error and ensuring that [math]\displaystyle{ f }[/math] is smooth. A detailed discussion of RKHSs is beyond the scope of this paper (for details, see Evgeniou et al., 2000, and the references therein). For our purposes, there are essentially only two key facts about RKHS. The first is the "Representer Theorem" (Wahba, 1990), which states that under very general conditions on the loss function V, the solution to a Tikhonov minimization problem can be written as a sum of kernel products on the training set: [math]\displaystyle{ f(x) = ` Xi=1 ciK(xi; xj): (1) }[/math]

The goal of a Tikhonov minimization procedure is to find the [math]\displaystyle{ c_i }[/math]. The other fact is that for functions represented in this form, [math]\displaystyle{ \bar \mid f ||^2 K = \mathbf{c}^TK\mathbf{c} }[/math].

Given these facts, a range of different algorithms can be derived by choosing the function V .

If we choose [math]\displaystyle{ V (f(\mathbf{x}); y) = (f(\mathbf{x}) - y)^2 }[/math], the so-called “square loss", the Tikhonov minimization problem becomes regularized least squares classification (RLSC): [2] [math]\displaystyle{ (K + \lambda \l I)c = y. }[/math]

Footnotes

  1. In our framework, each data point is required to belong to a single class. We distinguish this from the case when there are more than two classes, but a given example can be a member of more than one class simultaneously. In the latter case, if the labels are independent, the problem very naturally decomposes into [math]\displaystyle{ N }[/math] unlinked binary problems, where the [math]\displaystyle{ i }[/math]th binary learner simply learns to distinguish whether or not an example is in class i. If the labels are dependent, then how best to perform multiclass classification is an interesting research problem, but is beyond the scope of this paper.
  2. Regularized least-squares classification is not a new algorithm. The mathematics of finding a linear function that minimizes the square loss over a set of data was first derived by Gauss (1823). The idea of regularization is apparent in the work of Tikhonov and Arsenin (1977), who used least-squares regularization to restore well-posedness to ill-posed problems. SchÄonberg's seminal article on smoothing splines (SchÄonberg, 1964) also used regularization. These authors considered regression problems rather than classification, and did not use reproducing kernel Hilbert spaces as regularizers. In 1971, Wahba and Kimeldorf (1971) considered square-loss regularization using the norm in a reproducing kernel Hilbert space as a stabilizer. Only regression problems were considered in this work. In 1989, Girosi and Poggio considered regularized classification and regression problems with the square loss (Girosi and Poggio, 1989, Poggio and Girosi, 1990). They used pseudo-differential operators as their stabilizers; these are essentially equivalent to using the norm in an RKHS.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 InDefenseofOneVsAllClassificatiAldebaro Klautau
Ryan M. Rifkin
In Defense of One-Vs-All ClassificationThe Journal of Machine Learning Research2004