Fast Uncertainty Sampling for labeling large email corpora
Abstract
One of the biggest challenges in building effective anti-spam solutions is designing systems to defend against the everevolving bag of tricks spammers use to defeat them. Because of this, spam filters that work well today may not work well tomorrow. The adversarial nature of the spam problem makes large, up-to-date, and diverse e-mail corpora critical for the development and evaluation of new anti-spam filtering technologies. Gathering large collections of messages can actually be quite easy, especially in the context of a large, corporate or ISP environment. The challenge is not necessarily in collecting enough mail, however, but in collecting a representative distribution of mail types as seen "in the wild" and in then accurately labeling the hundreds of thousands or millions of accumulated messages as spam or non-spam. In the field of machine learning Uncertainty Sampling is a well-known Active Learning algorithm which uses a collaborative model to minimize the human effort required to label large datasets. While conventional Uncertainty Sampling has been shown to be very effective, it is also computationally very expensive since the learner must reclassify all the unlabeled instances during each learning iteration. We propose a new algorithm, Approximate Uncertainty Sampling (AUS), which is nearly as efficacious as Uncertainty Sampling, but has substantially lower computational complexity. The reduced computational cost allows Approximate Uncertainty Sampling to be applied to labeling larger datasets and also makes it possible to update the learned model more frequently. Approximate Uncertainty Sampling encourages the building of larger, more topical, and more realistic example e-mail corpora for evaluating new anti-spam filters. While we focus on the binary labeling of large volumes of e-mail messages, as with Uncertainty Sampling, Approximate Uncertainty Sampling can be used with a wide range of underlying classification algorithms for a variety of categorization tasks.