Comparing Fisher Information Regularization with Distillation for DNN Quantization
A large body of work addresses deep neural network (DNN) quantization and pruning to mitigate the high computational burden of deploying DNNs. We analyze two prominent classes of methods; the first class uses regularization based on the Fisher Information Matrix (FIM) of parameters, whereas the other uses a student-teacher paradigm, referred to as Knowledge Distillation (KD). The Fisher criterion can be interpreted as regularizing the network by penalizing the approximate KL-divergence (KLD) between the output of the original model and that of the quantized model. The KD approach bypasses the need to estimate the FIM and directly minimizes the KLD between the two models. We place these two approaches in a unified setting, and study their generalization characteristics using their loss landscapes. Using CIFAR-10 and CIFAR-100 datasets, we show that for higher temperatures, distillation produces wider minima in loss landscapes and yields higher accuracy than the Fisher criterion.