About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SODA 2007
Conference paper
Efficient aggregation algorithms for probabilistic data
Abstract
We study the problem of computing aggregation operators on probabilistic data in an I/O efficient manner. Algorithms for aggregation operators such as SUM, COUNT, AVG, and MIN/MAX are crucial to applications on probabilistic databases. We give a generalization of the classical data stream model to handle probabilistic data, called probabilistic streams, in order to analyze the I/O-requirements of our algorithms. Whereas the algorithms for SUM and COUNT turn out to be simple, the problem is harder for both AVG and MIN/MAX. Although data stream algorithms typically use randomness, all of the algorithms we present are deterministic. For MIN and MAX, we obtain efficient one-pass data stream algorithms for estimating each of these quantities with relative accuracy (1 + ϵ), using constant update time per element and O(1/ϵ lg R) space, where each element has a value between 1 and R. For AVG, we present a new data stream algorithm for estimating its value to a relative accuracy (1 + ∈) in O(log n) passes over the data with O( 1/ϵ log2 n) space and update time O( 1/ϵ log n) per element. On the other hand, we prove a space lower bound of Ω(n) for any exact one-pass deterministic data stream algorithm. Complementing this result, we also present an O(n log2 n)-time exact deterministic algorithm which uses O(n) space (thus removing the data-streaming restriction), improving dramatically on the previous O(n3)-time algorithm. Our algorithms for AVG involve a novel technique based on generating functions and numerical integration, which may be of independent interest. Finally, we provide an experimental analysis and show that our algorithms, coupled with additional heuristics, have excellent performance over large data sets.