RaPiD: AI Accelerator for Ultra-Low Precision Training and Inference

Swagath Venkataramani; Vijayalakshmi Srinivasan; Wei Wang; Sanchari Sen; Jintao Zhang; Ankur Agrawal; Monodeep Kar; Shubham Jain; Alberto Mannari; Hoang Tran; Yulong Li; Eri Ogawa; Kazuaki Ishizaki; Hiroshi Inoue; Marcel Schaal; Mauricio Serrano; Jungwook Choi; Xiao Sun; Naigang Wang; Chia-Yu Chen; Allison Allain; James Bonano; Nianzheng Cao; Robert Casatuta; Matthew Cohen; Bruce Fleischer; Michael Guillorn; Howard Haynie; Jinwook Jung; Mingu Kang; Kyu-Hyoun Kim; Siyu Koswatta; Saekyu Lee; Martin Lutz; Silvia Müller; Jinwook Oh; Ashish Ranjan; Zhibin Ren; Scot Rider; Kerstin Schelm; Michael Scheuermann; Joel Silberman; Jie Yang; Vidhi Zalani; Xin Zhang; Ching Zhou; Matt Ziegler; Vinay Shah; Moriyoshi Ohara; Pong-Fei Lu; Brian Curran; Sunil Shukla; Leland Chang; Kailash Gopalakrishnan

doi:10.1109/ISCA52012.2021.00021

ISCA 2021

Conference paper

14 Jun 2021

RaPiD: AI Accelerator for Ultra-Low Precision Training and Inference

Download paper

Abstract

The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RAPID, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RAPID chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RAPID chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RAPID chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.

Conference paper

Solving optimization tasks power-efficiently exploiting VO₂'s phase-change properties with Oscillating Neural Networks

Olivier Maher, N. Harnack, et al.

DRC 2023

Paper

A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Sae Kyu Lee, Ankur Agrawal, et al.

IEEE JSSC

Paper

Filamentary TaO_x/HfO₂ ReRAM Devices for Neural Networks Training with Analog In-Memory Computing

Tommaso Stecconi, Roberto Guido, et al.

Advanced Electronic Materials

Conference paper

A Multiscale Workflow for Thermal Analysis of 3DI Chip Stacks

Max Bloomfield, Amogh Wasti, et al.

ITherm 2025

View all publications

Abstract

Related

Solving optimization tasks power-efficiently exploiting VO2's phase-change properties with Oscillating Neural Networks

A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Filamentary TaOx/HfO2 ReRAM Devices for Neural Networks Training with Analog In-Memory Computing

A Multiscale Workflow for Thermal Analysis of 3DI Chip Stacks

Solving optimization tasks power-efficiently exploiting VO₂'s phase-change properties with Oscillating Neural Networks

Filamentary TaO_x/HfO₂ ReRAM Devices for Neural Networks Training with Analog In-Memory Computing