A General Method for Estimating Correlated Aggregates Over a Data Stream
Abstract
On a stream (Formula Presented.) of two dimensional data items (x, y) where x is an item identifier and y is a numerical attribute, a correlated aggregate query (Formula Presented.) asks to first apply a selection predicate σ along the y dimension, followed by an aggregation AGG along the x dimension. For selection predicates of the form (Formula Presented.) or (Formula Presented.), where parameter c is provided at query time, we present new streaming algorithms and lower bounds for estimating correlated aggregates. Our main result is a general method that reduces the estimation of a correlated aggregate AGG to the streaming computation of AGG over an entire stream, for an aggregate that satisfies certain conditions. This results in the first sublinear space algorithms for the correlated estimation of a large family of statistics, including frequency moments. Our experimental validation shows that the memory requirements of these algorithms are significantly smaller than existing linear storage solutions, and that these achieve a fast per-record processing time. We also study the setting when items have weights. In the case when weights can be negative, we give a strong space lower bound which holds even if the algorithm is allowed up to a logarithmic number of passes over the data. We complement this with a small space algorithm which uses a logarithmic number of passes.