Indexing text data under space constraints

Bijit Hore; Hakan Hacigumus; Sharad Mehrotra; Bala Iyer

doi:10.1145/1031171.1031212

CIKM 2004

Conference paper

08 Nov 2004

Indexing text data under space constraints

View publication

Abstract

An important class of queries is the LIKE predicate in SQL. In the absence of an index, LIKE queries are subject to performance degradation. The notion of indexing on substrings (or q-grams) has been explored earlier without sufficient consideration of efficiency, q-grams are used to prune away rows that do not qualify for the query. The problem is to identify a finite number of grams subject to storage constraint that gives maximal pruning for a given query workload. Our contributions include: i) a formal problem definition, proof that the problem is NP-hard and adaptation of a previously studied approximate algorithm that produces results within a provable error bound, ii) performance evaluation of the application of the novel method to real data, and iii) parallelization of the algorithm, scaling considerations and a proposal to handle scaling issues. Copyright 2004 ACM.

Conference paper