Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation
This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating P* Nσ, for Nσ ≜ N (0,σ 2 I d), by hat P n∗ Nσ under different statistical distances, where hat P n is the empirical measure. We examine the convergence in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and χ 2-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance (W 1) converges at the rate e O(d) n-1/2 in remarkable contrast to a (typical) n-frac 1 d rate for unsmoothed W 1 (and d ≥ 3). Similarly, for the KL divergence, squared 2-Wasserstein distance (W 22), and χ 2-divergence, the convergence rate is e O(d)} n-1, but only if P achieves finite input-output χ 2 mutual information across the additive white Gaussian noise (AWGN) channel. If the latter condition is not met, the rate changes to ω (n-1) for the KL divergence and W 22, while the χ 2-divergence becomes infinite-a curious dichotomy. As an application we consider estimating the differential entropy h(S+Z), where S∼ P and Z∼ Nσ are independent d-dimensional random variables. The distribution P is unknown and belongs to some nonparametric class, but n independently and identically distributed (i.i.d) samples from it are available. Despite the regularizing effect of noise, we first show that any good estimator (within an additive gap) for this problem must have a sample complexity that is exponential in d. We then leverage the above empirical approximation results to show that the absolute-error risk of the plug-in estimator converges as e O(d)} n-1/2, thus attaining the parametric rate in n. This establishes the plug-in estimator as minimax rate-optimal for the considered problem, with sharp dependence of the convergence rate both in n and d. We provide numerical results comparing the performance of the plug-in estimator to that of general-purpose (unstructured) differential entropy estimators (based on kernel density estimation (KDE) or k nearest neighbors (kNN) techniques) applied to samples of S+Z. These results reveal a significant empirical superiority of the plug-in to state-of-the-art KDE and kNN methods. As a motivating utilization of the plug-in approach, we estimate information flows in deep neural networks and discuss Tishby's Information Bottleneck and the compression conjecture, among others.