Richard Lawrence, Claudia Perlich, et al.
IBM Systems Journal
We present a statistical analysis of the AUC as an evaluation criterion for classification scoring models. First, we consider significance tests for the difference between AUC scores of two algorithms on the same test set. We derive exact moments under simplifying assumptions and use them to examine approximate practical methods from the literature. We then compare AUC to empirical misclassification error when the prediction goal is to minimize future error rate. We show that the AUC may be preferable to empirical error even in this case and discuss the tradeoff between approximation error and estimation error underlying this phenomenon.
Richard Lawrence, Claudia Perlich, et al.
IBM Systems Journal
Robert Tibshirani, Michael Saunders, et al.
JRSSB: Statistical Methodology
Aurélie C. Lozano, Naoki Abe, et al.
KDD 2009
Claudia Perlich, Saharon Rosset, et al.
KDD 2007