Publication
Nat. Comput. Sci.
Paper

Efficient Scaling of Large Language Models with Mixture of Experts and 3D Analog In-Memory Computing

View publication

Abstract

Large Language Models (LLMs) with their remarkable generative capacities have significantly impacted various fields, yet face challenges due to their immense parameter counts and the resulting high costs of training and inference. The trend of increasing model sizes exacerbates these challenges, particularly in terms of memory footprint, latency, and energy consumption. Traditional hardware like GPUs, while powerful, are not optimally efficient for LLM inference, leading to a growing dependence on cloud services. In this paper, we explore the deployment of Mixture of Experts (MoE)-based models on 3D Non-Volatile Memory (NVM)-based Analog In-Memory Computing (AIMC) hardware. When combined with the MoE network architecture, this novel hardware paradigm, utilizing stacked NVM devices arranged in a crossbar array, offers an innovative solution to the parameter fetching bottleneck typical in traditional models deployed on conventional von Neumann-based architectures. By simulating the deployment of both dense and MoE-based LLMs on an abstract 3D NVM-based AIMC system, we demonstrate that MoE-based models, due to their conditional compute paradigm, are better suited to this hardware, scaling more favorably and maintaining high performance even in the presence of noise typical of analog computations. Our findings suggest that MoE-based models, in conjunction with emerging 3D NVM-based AIMC, can significantly reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.