Viterbi-based pruning for sparse matrix with fixed and high index compression ratio
Weight pruning has proven to be an effective method of reducing the model size and computation cost without sacrificing its model accuracy. Conventional sparse matrix formats, however, involve irregular index structures with large storage requirement and a sequential reconstruction process, resulting in inefficient use of highly parallel computing resources. Hence, pruning is usually restricted to inference with a batch size of one, for which an efficient parallel matrix-vector multiplication method exists. In this paper, a new class of sparse matrix representation is proposed utilizing the Viterbi algorithm that has a high, and more importantly, fixed index compression ratio regardless of the pruning rate. In this approach, numerous sparse matrix candidates are first generated by the Viterbi encoder, and the candidate that aims to minimize the model accuracy degradation is then selected by the Viterbi algorithm. The model pruning process based on the proposed Viterbi encoder and Viterbi algorithm is highly parallelizable, and can be implemented efficiently in hardware to achieve low-energy and a high-performance index decoding process. Compared with the existing magnitude-based pruning methods, the index data storage requirement can be further compressed by 85.2% in MNIST and 83.9% in AlexNet while achieving a similar pruning rate. Even compared with the relative index compression technique, our method can still reduce the index storage requirement by 52.7% in MNIST and 35.5% in AlexNet.