Publication
ICDT 2009
Conference paper

The average-case complexity of counting distinct elements

View publication

Abstract

We continue the study of approximating the number of distinct elements in a data stream of length n to within a (1±ε) factor. It is known that if the stream may consist of arbitrary data arriving in an arbitrary order, then any 1-pass algorithm requires Ω(1/ε2) bits of space to perform this task. To try to bypass this lower bound, the problem was recently studied in a model in which the stream may consist of arbitrary data, but it arrives to the algorithm in a random order. However, even in this model an Ω(1/ε2) lower bound was established. This is because the adversary can still choose the data arbitrarily. This leaves open the possibility that the problem is only hard under a pathological choice of data, which would be of little practical relevance. We study the average-case complexity of this problem under certain distributions. Namely, we study the case when each successive stream item is drawn independently and uniformly at random from an unknown subset of d items for an unknown value of d. This captures the notion of random, uncorrelated data. For a wide range of values of d and n, we design a 1-pass algorithm that bypasses the Ω(1/ε 2) lower bound that holds in the adversarial and random-order models, thereby showing that this model admits more space-efficient algorithms. Moreover, the update time of our algorithm is optimal. Despite these positive results, for a certain range of values of d and n we show that estimating the number of distinct elements requires Ω(1/ε2) bits of space even in this model. Our lower bound subsumes previous bounds, showing that even for natural choices of data the problem is hard. Copyright 2009 ACM.

Date

Publication

ICDT 2009

Authors

Topics

Share