Optimal distance bounds for fast search on compressed time-series query logs

Michail Vlachos; Suleyman S. Kozat; Philip S. Yu

doi:10.1145/1734200.1734203

ACM Transactions on the Web

Paper

29 Apr 2010

Optimal distance bounds for fast search on compressed time-series query logs

View publication

Abstract

Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it). Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10 - 30% improvement on distance estimations, which in turn leads to 25 - 80% improvement on the search performance. © 2010 ACM.

Paper