A bi-level Bernoulli scheme for database sampling
Abstract
Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision - the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the "bang-bang" type: we identify a "page-heterogeneity index" (PHI) such that optimal sampling is as "row-like" as possible if the PHI is less than 1 and as "page-like" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed.