About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SIGMOD/PODS 2014
Conference paper
Is min-wise hashing optimal for summarizing set intersection?
Abstract
Min-wise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a "min-hash") independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a min-hash is available for the set of points satisfying each predicate. This has applications in query optimization and for approximate computation of COUNT aggregates. In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with 1 ± ε relative error? The state-of-the-art technique for minimizing the encoding size, for any desired estimation error, is b-bit min-wise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds: • Using information complexity arguments, we show that b-bit min-wise hashing is space optimal for m = 2 predicates in the sense that the estimator's variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m > 2 predicates we show that the performance of b-bit min-wise hashing (and more generally any method based on "κ-permutation" min-hash) deteriorates as m grows. • We describe a new summary that nearly matches our lower bound for m > 2. It asymptotically outperform all κ-permutation schemes (by around a factor Ω(m/log m)), as well as methods based on subsampling (by a factor Ω (log nmax), where nmax is the maximum set size). Copyright 2014 ACM.