About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICDE 2006
Conference paper
Techniques for warehousing of sample data
Abstract
We consider the problem of maintaining a warehouse of sampled data that "shadows" a full-scale data warehouse, in order to support quick approximate analytics and metadata discovery. The full-scale warehouse comprises many "data sets," where a data set is a bag of values; the data sets can vary enormously in size. The values constituting a data set can arrive in batch or stream form. We provide and compare several new algorithms for independent and parallel uniform random sampling of data-set partitions, where the partitions are created by dividing the batch or splitting the stream. We also provide novel methods for merging samples to create a uniform sample from an arbitrary union of data-set partitions. Our sampling/merge methods are the first to simultaneously support statistical uniformity, a priori bounds on the sample footprint, and concise sample storage. As partitions are rolled in and out of the warehouse, the corresponding samples are rolled in and out of the sample warehouse. In this manner our sampling methods approximate the behavior of more sophisticated stream-sampling methods, while also supporting parallel processing. Experiments indicate that our methods are efficient and scalable, and provide guidance for their application. © 2006 IEEE.