Trimming the 1 regularizer: Statistical analysis, optimization, and applications to deep learning
Abstract
We study high-dimensional estimators with the trimmed 1 penalty, which leaves the h largest parameter entries penalty-free. While optimization techniques for this nonconvEx penalty have been studied, the statistical properties have not yet been analyzed. We present the first statistical analyses for m-estimation, and characterize support recovery, and 2 error of the trimmed 1 estimates as a function of the trimming parameter h. Our results show different regimes based on how h compares to the true support size. Our second contribution is a new algorithm for the trimmed regularization problem, which has the same theoretical convergence rate as difference of convex (DC) algorithms, but in practice is faster and finds lower objective values. Empirical evaluation of 1 trimming for sparse linear regression and graphical model estimation indicate that trimmed 1 can outperform vanilla 1 and non-convex alternatives. Our last contribution is to show that the trimmed penalty is beneficial beyond M-estimation, and yields promising results for two deep learning tasks: input structures recovery and network sparsification.