Haoran Liao, Derek S. Wang, et al.
Nature Machine Intelligence
We consider the problem of generating balanced training samples from an unlabeled data set with an unknown class distribution. While random sampling works well when the data are balanced, it is very ineffective for unbalanced data. Other approaches, such as active learning and cost-sensitive learning, are also suboptimal as they are classifier-dependent and require misclassification costs and labeled samples, respectively. We propose a new strategy for generating training samples, which is independent of the underlying class distribution of the data and the classifier that will be trained using the labeled data. Our methods are iterative and can be seen as variants of active learning, where we use semi-supervised clustering at each iteration to perform biased sampling from the clusters. We provide several strategies to estimate the underlying class distributions in the clusters and to increase the balancedness in the training samples. Experiments with both highly skewed and balanced data from the UCI repository and a private data set show that our algorithm produces much more balanced samples than random sampling or uncertainty sampling. Further, our sampling strategy is substantially more efficient than active learning methods. The experiments also validate that, with more balanced training data, classifiers trained with our samples outperform classifiers trained with random sampling or active learning.
Haoran Liao, Derek S. Wang, et al.
Nature Machine Intelligence
Kellen Cheng, Anna Lisa Gentile, et al.
EMNLP 2024
Zahra Ashktorab, Djallel Bouneffouf, et al.
IJCAI 2025
Baihan Lin, Guillermo Cecchi, et al.
IJCAI 2023