Rare category detection on time-evolving graphs
Abstract
Rare category detection(RCD) is an important topicin data mining, focusing on identifying the initial examples fromrare classes in imbalanced data sets. This problem becomes more challenging when the data is presented as time-evolving graphs, as used in synthetic ID detection and insider threat detection. Most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving RCD applications. To address this challenge, in this paper, we first proposetwo incremental RCD algorithms, SIRD and BIRD. They arebuilt upon existing density-based techniques for RCD, andincrementally update the detection models, which provide 'timeflexible' RCD. Furthermore, based on BIRD, we propose amodified version named BIRD-LI to deal with the cases wherethe exact priors of the minority classes are not available. Wealso identify a critical task in RCD named query distribution. Itaims to allocate the limited budget into multiple time steps, suchthat the initial examples from the rare classes are detected asearly as possible with the minimum labeling cost. The proposedincremental RCD algorithms and various query distributionstrategies are evaluated empirically on both synthetic and real data.