Sharing-aware outlier analytics over high-volume data streams
Abstract
Real-time analytics of anomalous phenomena on streaming data typically relies on processing a large variety of continuous outlier detection requests, each configured with different parameter settings. The processing of such complex outlier analytics workloads is resource consuming due to the algorithmic complexity of the outlier mining process. In this work we propose a sharing-aware multiquery execution strategy for outlier detection on data streams called SOP. A key insight of SOP is to transform the problem of handling a multi-query outlier analytics workload into a single-query skyline computation problem. We prove that the output of the skyline computation process corresponds to the minimal information needed for determining the outlier status of any point in the stream. Based on this new formulation, we design a customized skyline algorithm called K-SKY that leverages the domination relationships among the streaming data points to minimize the number of data points that must be evaluated for supporting multi-query outlier detection. Based on this K-SKY algorithm, our SOP solution achieves minimal utilization of both computational and memory resources for the processing of these complex outlier analytics workload. Our experimental study demonstrates that SOP consistently outperforms the state-of-art solutions by three orders of magnitude in CPU time, while only consuming 5% of their memory footprint - a clear winwin. Furthermore, SOP is shown to scale to large workloads composed of thousands of parameterized queries.