Zero-One Loss Function

AKA: 0-1 Loss.
Context:
- It can be a Bounded Loss Function.
- ...
Example(s):
- a Smooth Zero-One Loss Function.
- …
Counter-Example(s):
See: Optimization Problem, Classification System, Bayesian Classifier, Decision Theory, Indicator Notation.

References

(Hasan & Pal, 2020) ⇒ Kamrul Hasan and Christopher J. Pal (2020). “A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation". In: ACM Transactions on Knowledge Discovery from Data. DOI:10.1145/3365672.
- QUOTE: We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized Beta-Bernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning meta-parameters are themselves optimized using a validation set.

(Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/loss_function#0-1_loss_function Retrieved:2015-1-7.
- In statistics and decision theory, a frequently used loss function is the 0-1 loss function
  : [math]\displaystyle{ L(\hat{y}, y) = I(\hat{y} \ne y), \, }[/math]
  where [math]\displaystyle{ I }[/math] is the indicator notation.

(Gentle, 2009) ⇒ James E. Gentle. (2009). “Computational Statistics." Springer. ISBN:978-0-387-98143-7
- QUOTE: Any strictly convex loss function over an unbounded interval is unbounded. It is not always realistic to use an unbounded loss function. A common bounded loss function is the 0-1 loss function, which may be : [math]\displaystyle{ L_{0−1}(θ, a) = 0 \ \text{if} \mid g(θ)−a \mid ≤ α(n) }[/math] : [math]\displaystyle{ L_{0−1}(θ,a)=1 \ \text{otherwise}. }[/math]

(Domingos & Pazzani, 1997) ⇒ Pedro M. Domingos and Michael J. Pazzani (1997). "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss". Machine Learning 29, 103–130 (1997). DOI:10.1023/A:1007413511361.
- QUOTE: This article shows that, although the Bayesian classifier's probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zero-one loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadratic-loss optimality of the Bayesian classifier is in fact a second-order infinitesimal fraction of the region of zero-one optimality. This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption. Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain. This article's results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.