Towards Exact Gradient-based Training on Analog In-memory Computing
Abstract
Analog in-memory accelerators present a promising solution for energy-efficient training and inference of large vision or language models. While the inference on analog accelerators has been studied recently, the analog training perspective is under-explored. Recent studies have shown that the vanilla analog stochastic gradient descent (Analog SGD) algorithm {\em converges inexactly} and thus performs poorly when applied to model training on non-ideal devices. To tackle this issue, various analog-friendly gradient-based algorithms have been proposed, such as Tiki-Taka and its variants. Even though Tiki-Taka exhibits superior empirical performance compared to Analog SGD, it is a heuristic algorithm that lacks theoretical underpinnings. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergence issue of Analog SGD, which is caused by the asymptotic error arising from asymmetric updates and gradient noise. Further, we provide a convergence analysis of Tiki-Taka, which shows its ability to exactly converge to a critical point and hence eliminates the asymptotic error.The simulations verify the correctness of the analyses.