Faster upper bounding of intersection sizes

Daisuke Takuma; Hiroki Yanagisawa

doi:10.1145/2484028.2484065

SIGIR 2013

Conference paper

02 Sep 2013

Faster upper bounding of intersection sizes

View publication

Abstract

There is a long history of developing efficient algorithms for set intersection, which is a fundamental operation in information retrieval and databases. In this paper, we describe a new data structure, a Cardinality Filter, to quickly compute an upper bound on the size of a set intersection. Knowing an upper bound of the size can be used to accelerate many applications such as top-k query processing in text mining. Given finite sets A and B, the expected computation time for the upper bound of the size of the intersection \A n B\ is O((|A| + |B|)/w), where w is the machine word length. This is much faster than the current best algorithm for the exact intersection, which runs in O((|A| + |B|)√w+\A∩B) expected time. Our performance studies show that our implementations of Cardinality Filters are from 2 to 10 times faster than existing set intersection algorithms, and the time for a top-k query in a text mining application can be reduced by half. Copyright © 2013 ACM.

Conference paper