Over the past few years, reduced precision techniques have proven exceptionally effective in accelerating deep learning training and inference applications. IBM Research has played a leadership role in developing reduced precision technologies and pioneered a number of key breakthroughs, including the first 16-bit reduced-precision systems for deep learning training (presented at ICML 2015), the first 8-bit training techniques (presented recently at NeurIPS 2018), and state-of-the-art 2-bit inference results (published at SysML 2019).
Following this line of work, we now introduce a new breakthrough which solves a long-ignored, yet important problem in reduced-precision deep learning: accumulation bit-width scaling for ultra-low-precision training of deep neural networks (DNNs).
The most commonly used arithmetic function in deep learning is the dot product, which is the building block of generalized matrix multiplication (GEMM) and convolution computations. A dot product is computed used multiply accumulate (MAC) operations and thus requires two floating-point computations: multiplication of 2 numbers and accumulation of the product into partial sums. Today, much of the effort on reduced-precision deep learning focuses solely on quantizing representations, i.e. input operands to the multiplication operation. The other major portion of dot product computations, i.e. the partial sum accumulation, has always kept in full (32-bit) precision. The reason is that reduced-precision accumulations can result in severe training instability and degradation in model accuracy, as shown in Fig. 1a. This is especially unfortunate, since the area and the power of the hardware is dominated by the accumulator bit-width as the precisions are aggressively reduced. As shown in Fig. 1b, accumulating in high precision severely limits the hardware benefits of reduced-precision data representations and computation. The absence of any framework to analyze the precision requirements of partial sum accumulations inevitably results in very conservative design choices.
Figure 1. The importance of accumulation precision. (a) Convergence curves of an ImageNet ResNet18 experiment using reduced precision accumulation. The current practice is to keep the accumulation in full precision to avoid such divergence. (b) Estimated area benefits when reducing the precision of a floating-point unit (FPU). The terminology FPa/b denotes an FPU whose multiplier and adder use a and b bits, respectively. Our work enables convergence in reduced precision accumulation and gains an extra 1.5–2.2× area reduction.
In our paper published at ICLR 2019, we take a step forward and present a comprehensive statistical model to analyze the impact of reduced-precision accumulation in deep learning training. We observed and learned two critical insights. First, when we accumulate a dot product in scaled precision, the loss of information is primarily due to a so-called “swamping error”. In floating point arithmetic, swamping error occurs when a large number is added to a small number, the small number will be completely or partially truncated out of the addition. Second, swamping error will harm the statistics (i.e. variance) of a dot product. To ensure stable convergence of DNNs, it is a necessity to preserve the variance of dot products under reduced precision.
Using these insights, we derived a set of equations and introduced a new metric called variance retention ratio (VRR) of a reduced-precision accumulation in the context of the three deep learning GEMM functions. The VRR is a function of the accumulation length and minimum number of bits needed for accumulation only, which needs no simulation to be computed (Fig. 2). The VRR can be used to assess the suitability, or lack thereof, of a precision configuration and allows us to determine accumulation bit-widths for precise tailoring of deep learning computation hardware. From these VRR calculations in Fig. 2, it can be easily seen that chunk-based accumulations for typical deep learning computations can preserve accuracy down to 9-bits of accumulation precision, macc, (which corresponds to fp16 accumulations) while traditional non-chunk additions need much higher macc (of up to 15-bits, corresponding to fp32 accumulations). This reduced accumulation bit-width requirement translates directly to a 1.5–2.2× improvement in hardware energy efficiency (as indicated in Fig. 1).
Figure 2. Normalized variance lost as a function of accumulation length for different values of accumulation bit-width, macc, for (a) a normal accumulation (no chunking) and (b) a chunk-based accumulation (chunk size of 64). The ”knees” in each plot correspond to the maximum accumulation length for a given precision which indicates how the VRR is to be used to select a suitable precision.
Using the analysis, we successfully predicted and experimentally verified the minimum accumulation precisions required by the three GEMM functions across three popular benchmarking networks (CIFAR-10 ResNet34, ImageNet ResNet18, and ImageNet AlexNet). Our results prove that our method is able to accurately pinpoint the minimum precision needed for the convergence of benchmark networks to the full-precision baseline. On the practical side, this analysis is a useful tool for hardware designers implementing reduced precision processing hardware. We believe this work addresses a critical missing link on the path to ultra-low-precision hardware for DNN training.