Roughly balanced Bagging for Imbalanced data
Abstract
The class imbalance problem appears in many real-world applications of classification learning. We propose an ensemble algorithm "Roughly Balanced (RB) Bagging" using a novel sampling technique to improve the original bagging algorithm for data sets with skewed class distributions. For this sampling method, the number of samples in the largest and smallest classes are different, but they are effectively balanced when averaged over all of the subsets, which supports the approach of bagging in a more appropriate way. Individual models in RB Bagging tend to show larger diversity, which is one of the keys of ensemble models, compared with existing bagging-based methods for imbalanced data that use exactly the same number of majority and minority examples for every training subset. In addition, the proposed method makes full use of all of the minority examples by under-sampling, which is efficiently done by using negative binomial distributions. Numerical experiments using benchmark and real-world data sets demonstrate that RB Bagging shows better performance than the existing "balanced" methods and other common methods for area under the ROC curve (AUC), which is a widely used metric in the class imbalance problem. © 2009 Wiley Periodicals, Inc.