About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
APPROX/RANDOM 2020
Conference paper
Testing data binnings
Abstract
Motivated by the question of data quantization and “binning,” we revisit the problem of identity testing of discrete probability distributions. Identity testing (a.k.a. one-sample testing), a fundamental and by now well-understood problem in distribution testing, asks, given a reference distribution (model) q and samples from an unknown distribution p, both over [n] = {1, 2,..., n}, whether p equals q, or is significantly different from it. In this paper, we introduce the related question of identity up to binning, where the reference distribution q is over k ≪ n elements: the question is then whether there exists a suitable binning of the domain [n] into k intervals such that, once “binned,” p is equal to q. We provide nearly tight upper and lower bounds on the sample complexity of this new question, showing both a quantitative and qualitative difference with the vanilla identity testing one, and answering an open question of Canonne [6]. Finally, we discuss several extensions and related research directions.