Hardware Algorithm Co-­optimization for Scalable Analog Compute Technology

Takashi Ando

SISC 2022

Invited talk

08 Dec 2022

Hardware Algorithm Co-optimization for Scalable Analog Compute Technology

Abstract

Dense crossbar arrays of non-volatile memory (NVM) can potentially enable massively-parallel and highly energy-efficient computing systems [1]. We have introduced an ensemble of Resistive Processing Unit (RPU) devices that can simultaneously store and process data locally and in parallel, thus providing significant acceleration for deep neural network (DNN) training [2]. Our analysis shows that conventional NVM elements do not meet the requirements needed for optimal RPU operation. In particular, we find that the optimal switching behavior is analog in nature and that symmetric resistance switching under pulse stimulation is required. One approach to leverage existing NVM technologies is to customize the training algorithm and to make it more robust against the non-ideality of device. Fig. 1 shows how hardware and algorithm are cooptimized using HfO2 based resistive random access memory (ReRAM) as an example [3]. We have recently proposed a modified stochastic gradient descent (SGD) algorithm, called “Tiki-Taka (TT)”, to significantly relax the requirement on the switching symmetry and to improve tolerance for device programming noise. Compared to conventional SGD algorithm, TTv1 [4] and TTv2 [5] use an array A to accumulate the gradients in-memory, and a separate array C that stores the weight of a linear layer. C is updated relatively slowly (intermittently in row-by-row fashion) with the recent past of the accumulated gradients read from A. It has been shown that this (spatial and temporal) separation helps to compensate for asymmetry and stochastic conductance responses, without compromising on the parallel in-memory compute. TTv2 additionally adds relatively sparse digital compute to filter the read-out gradients from A before writing to C to greatly attenuate noise fluctuations (Fig. 2). In order to estimate scalability of this approach, we obtained model parameters from the ReRAM array with 2k devices and then performed DNN training simulations (with an adapted version of the AIHWKIT simulator [6]). Fig.3 shows the expected test error on a 3-layer fully connected DNN on the MNIST data set [7]. By optimizing the ReRAM material in conjunction with TTv2, 97% accuracy could be achieved, although further optimization is needed to reach FP accuracy (98.2%). The above approach for utilizing existing NVM technologies come with the cost of additional periphery circuits and memory elements. Therefore, in order to fully leverage the advantage of analog computing, a new concept device tailored for the requirements of deep learning applications is highly desired. One such example is Electrochemical Random-Access Memory (ECRAM) [8, 9]. ECRAM is a three-terminal device composed of a conductive channel, an insulating electrolyte, an ionic reservoir, and metal contacts. The resistance of the channel is modulated by ionic exchange at the interface between the channel and the electrolyte upon application of an electric field. The charge-transfer process allows both for state retention in the absence of applied power, and for programming of multiple distinct levels. The write operation is deterministic and can result in symmetrical potentiation and depression, making ECRAM arrays attractive for deep learning applications (Fig. 4). For initial demonstration of deep learning training using analog NVM at scale, the conventional CMOS and memory technologies should be fully leveraged. At this near-to-middle term technology development, co-optimization of device and algorithm is the key to mitigate the non-ideality of devices as we review in this talk. Most NVM devices available today are optimized for digital applications. There is a plenty of room to improve analog switching quality at the expense of more traditional figures of merit for memory that are not directly relevant to deep learning training. For a long-term development, continued effort to build a specialized device from the ground up is indispensable. Acknowledgement I would like to thank my collaborators at IBM Research and Tokyo Electron. This work is supported by the IBM Research AI Hardware Center (ibm.co/ai-hardware-center). References [1] G.W. Burr et al., IEEE Spectrum, vol. 58, no. 12, p.44, (2021) [2] T. Gokmen and Y. Vlasov, Frontiers in Neuroscience 10 (2016) [3] Y. Kim et al., IEEE EDL, 42, p.759 (2021) [4] T. Gokmen et al., Front. Neurosci. 14:103 (2020) [5] T. Gokmen et al., Front. Neurosci, 4:126 (2021) [6] M. J. Rasch et al., IEEE AICAS, p.1 (2021) [7] N. Gong, M. J. Rasch et al., IEDM, 33.7 (2022) [8] J. Tang et al., IEDM, p.292 (2018) [9] S. Kim et al., IEDM, p.847 (2019)

Conference paper