Area Under the Receiver-Operator Curve (AUC) Metric
Jump to navigation
Jump to search
An Area Under the Receiver-Operator Curve (AUC) Metric is a classifier performance measure based on the area of an ROC curve.
- Context:
- It can be computed by sorting predicted classes on the prediction scores, calculating the TPR and FPR for each predicted class, and calculating the AUC using trapezoid approximation.
- It can provide a single-number discrimination score summarizing overall model performance over all possible range of thresholds; which enables avoiding the subjectivity in the threshold selection.
- It can be applied to any predictive model with a scoring function.
- It can produce an AUC score that is bounded between [0,1] with the score of 0.5 for random predictions, and 1 for perfect predictions.
- It can be used by an AUC-based Analysis Task (for both offline predictive model monitoring and online predictive model monitoring).
- Example(s):
- Counter-Example(s):
- See: Accuracy Measure, ROC Measure, Brier Score, Matthews Correlation Coefficient, Wilcoxon Signed-Rank Test, Gini Coefficient.
References
2018
- (Google ML Glossary, 2018) ⇒ (2008). AUC (Area under the ROC Curve). In: Machine Learning Glossary https://developers.google.com/machine-learning/glossary/ Retrieved: 2018-04-22.
- QUOTE: An evaluation metric that considers all possible classification thresholds.
The Area Under the ROC curve is the probability that a classifier will be more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive.
- QUOTE: An evaluation metric that considers all possible classification thresholds.
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve Retrieved:2015-7-18.
- When using normalized units, the area under the curve (often referred to as simply the AUC, or AUROC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative'). [1] This can be seen as follows: the area under the curve is given by (the integral boundaries are reversed as large T has a lower value on the x-axis) : [math]\displaystyle{ A = \int_{\infty}^{-\infty} y(T) x'(T) \, dT = \int_{\infty}^{-\infty} \mbox{TPR}(T) \mbox{FPR}'(T) \, dT = \int_{-\infty}^{\infty} \mbox{TPR}(T) P_0(T) \, dT = \langle \mbox{TPR} \rangle }[/math] . The angular brackets denote average from the distribution of negative samples. It can further be shown that the AUC is closely related to the Mann–Whitney U,[2] [3] which tests whether positives are ranked higher than negatives. It is also equivalent to the Wilcoxon test of ranks. The AUC is related to the Gini coefficient ([math]\displaystyle{ G_1 }[/math] ) by the formula [math]\displaystyle{ G_1 = 2 \mbox{AUC} - 1 }[/math] , where: : [math]\displaystyle{ G_1 = 1 - \sum_{k=1}^n (X_{k} - X_{k-1}) (Y_k + Y_{k-1}) }[/math] [4] In this way, it is possible to calculate the AUC by using an average of a number of trapezoidal approximations. It is also common to calculate the Area Under the ROC Convex Hull (ROC AUCH = ROCH AUC) as any point on the line segment between two prediction results can be achieved by randomly using one or other system with probabilities proportional to the relative length of the opposite component of the segment. Interestingly, it is also possible to invert concavities – just as in the figure the worse solution can be reflected to become a better solution; concavities can be reflected in any line segment, but this more extreme form of fusion is much more likely to overfit the data.[5] The machine learning community most often uses the ROC AUC statistic for model comparison. However, this practice has recently been questioned based upon new machine learning research that shows that the AUC is quite noisy as a classification measure[6] and has some other significant problems in model comparison.[7] [8] A reliable and valid AUC estimate can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. However, the critical research suggests frequent failures in obtaining reliable and valid AUC estimates. Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution. Nonetheless, the coherence of AUC as a measure of aggregated classification performance has been vindicated, in terms of a uniform rate distribution,[9] and AUC has been linked to a number of other performance metrics such as the Brier score.[10] One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system, as well as ignoring the possibility of concavity repair, so that related alternative measures such as Informedness[11] or DeltaP are recommended. These measures are essentially equivalent to the Gini for a single prediction point with DeltaP' = Informedness = 2AUC-1, whilst DeltaP = Markedness represents the dual (viz. predicting the prediction from the real class) and their geometric mean is the Matthews correlation coefficient.
2014
- Data School. (2014). “ROC Curves and Area Under the Curve (AUC) Explained."
2011
- (Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Area Under Curve.” In: (Sammut & Webb, 2011) p.41
- QUOTE: The area under curve (AUC) statistic is an empirical-measure of classification performance based on the area under an ROC-curve. It evaluates the performance of a scoring classifier on a test set, but ignores the magnitude of the scores and only takes their rank order into account. AUC is expressed on a scale of 0 to1, where 0 means that all negatives are ranked before all positives, and 1 means that all positives are ranked before all negatives. See ROC Analysis.
2009
- (Hand, 2009) ⇒ David J. Hand. (2009). “Mismatched Models, Wrong Results, and Dreadful Decisions: On Choosing Appropriate Data Mining Tools.” In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009). doi:10.1145/1557019.1557021
- QUOTE: For predictive classification problems, a wide variety of score functions exist, including measures such as precision and recall, the F measure, misclassification rate, the area under the ROC curve (the AUC), and others. The first four of these require a 'classification threshold' to be chosen, a choice which may not be easy, or may even be impossible, especially when the classification rule is to be applied in the future. In contrast, the AUC does not require the specification of a classification threshold,
2008
- (Chakrabarti et al., 2008) ⇒ Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru Bhattacharyya. (2008). “Structured Learning for Non-smooth Ranking Losses.” In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008). doi:10.1145/1401890.1401906
- QUOTE: Learning to rank from relevance judgment is an active research area. … Listwise structured learning has been applied recently to optimize important non-decomposable ranking criteria like AUC (area under ROC curve) and MAP (mean average precision). We propose new, almost-linear-time algorithms to optimize for two other criteria widely used to evaluate search systems: MRR (mean reciprocal rank) and NDCG (normalized discounted cumulative gain) in the max-margin structured learning framework.
2005
- (Fogarty et al., 2005) ⇒ James Fogarty, Ryan S. Baker, and Scott E. Hudson. (2005). “Case Studies in the use of ROC Curve Analysis for Sensor-based Estimates in Human Computer Interaction.” In: Proceedings of Graphics Interface 2005 (GI 2005).
2004
- (Caruana & Niculescu-Mizil, 2004) ⇒ Rich Caruana, and Alexandru Niculescu-Mizil. (2004). “Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria.” In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ISBN:1-58113-888-1 doi:10.1145/1014052.1014063
- QUOTE: … compare nine boolean classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve, Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low dimensional manifold.
2002
- (Chawla et al., 2002) ⇒ Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. (2002). “SMOTE: Synthetic Minority over-sampling Technique.” In: Journal of Artificial Intelligence Research, 16(1).
- QUOTE: The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
1997
- (Bradley, 1997) ⇒ Andrew P. Bradley. (1997). “The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms.” In: Pattern Recognition Journal, 30(7). doi:10.1016/S0031-3203(96)00142-2
- QUOTE: In this paper we investigate the use of the area under the receiver operating characteristic (ROC) curve (AUC) as a performance measure for machine learning algorithms.
- ↑ Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.
- ↑ Hanley, James A.; McNeil, Barbara J. (1982). “The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve". Radiology. 143 (1): 29–36. doi:10.1148/radiology.143.1.7063747. PMID 7063747.
- ↑ Mason, Simon J.; Graham, Nicholas E. (2002). “Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation" (PDF). Quarterly Journal of the Royal Meteorological Society. 128: 2145–2166. doi:10.1256/003590002320603584.
- ↑ Hand, David J.; and Till, Robert J. (2001); A simple generalization of the area under the ROC curve for multiple class classification problems, Machine Learning, 45, 171–186.
- ↑ Flach, P.A.; Wu, S. (2005). “Repairing concavities in ROC curves.” (PDF). 19th International Joint Conference on Artificial Intelligence (IJCAI'05). pp. 702–707.
- ↑ Hanczar, Blaise; Hua, Jianping; Sima, Chao; Weinstein, John; Bittner, Michael; and Dougherty, Edward R. (2010); Small-sample precision of ROC-related estimates, Bioinformatics 26 (6): 822–830
- ↑ Lobo, Jorge M.; Jiménez-Valverde, Alberto; and Real, Raimundo (2008), AUC: a misleading measure of the performance of predictive distribution models, Global Ecology and Biogeography, 17: 145–151
- ↑ Hand, David J. (2009); Measuring classifier performance: A coherent alternative to the area under the ROC curve, Machine Learning, 77: 103–123
- ↑ Flach, P.A.; Hernandez-Orallo, J.; Ferri, C. (2011). “A coherent interpretation of AUC as a measure of aggregated classification performance.” (PDF). Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 657–664.
- ↑ Hernandez-Orallo, J.; Flach, P.A.; Ferri, C. (2012). “A unified view of performance metrics: translating threshold choice into expected classification loss" (PDF). Journal of Machine Learning Research. 13: 2813–2869.
- ↑ [1]