Finding heavy distinct hitters in data streams
Abstract
A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct network connections. While it is fairly easy to maintain an accurate estimate of the current total number of distinct connections using streaming algorithms that exhibit both a low space and computational complexity, identifying the network entities that are involved in the largest number of distinct connections efficiently is considerably harder. In this paper, we study the problem of finding all entities whose number of distinct (outgoing or incoming) network connections is at least a specific fraction of the total number of distinct connections. These entities are referred to as heavy distinct hitters. Since this problem is hard in general, we focus on randomized approximation techniques and propose a sampling-based and a sketch-based streaming algorithm. Both algorithms output a list of the potential heavy distinct hitters including the estimated counts of the corresponding number of distinct connections. We prove that, depending on the required level of accuracy of the output list, the space complexities of the presented algorithms are asymptotically optimal up to small logarithmic factors. Additionally, the algorithms are evaluated and compared using real network data in order to determine their usefulness in practice. © 2011 ACM.