Efficacy of Pruning in Ultra-Low Precision DNNs
Quantization, or reducing the precision of variables and operations, and pruning, or removing neurons and connections are two popular approaches for improving the efficiency of DNNs. These directions have been pursued largely separately. In this work, we investigate combining DNN quantization and pruning as each of them is pushed to its limits. Specifically, we explore the efficacy of pruning DNNs for inference in the ultralow precision (sub-8 bit) regime. Pruning requires the use of sparse formats to store and access only the non-zero values. We provide analytical expressions for the storage required by these sparse formats. We demonstrate that with decreasing weight precision, the overhead of indexing non-zero locations starts to dominate, greatly diminishing the benefits of pruning. We specifically analyze the compression ratios of two popular sparse formats - Compressed Sparse Column (CSC) and Sparsity Map (Smap) - and demonstrate that they drop significantly, even degrading to <1 (inflation) in some cases. To address this challenge, we make a key observation that the best-performing sparse format varies across different precisions and sparsity levels. Based on this observation, we propose a hybrid compression scheme that dynamically chooses between sparse formats at both the network and layer-level granularities. We further propose a new scheme, compressed Sparsity Map (cSmap), to enhance the performance of hybrid compression. The cSmap scheme improves the Smap scheme by applying compression methods widely used in manufacturing test. Across 6 state-of-the-art Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), the proposed hybrid compression scheme improves the average compression ratio by 18.3% - 34.7% compared to previous approaches.