A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling

Ankur Agrawal; Saekyu Lee; Joel Silberman; Matt Ziegler; Mingu Kang; Swagath Venkataramani; Nianzheng Cao; Bruce Fleischer; Michael Guillorn; Matthew Cohen; Silvia Müller; Jinwook Oh; Martin Lutz; Jinwook Jung; Siyu Koswatta; Ching Zhou; Vidhi Zalani; James Bonanno; Robert Casatuta; Chia-Yu Chen; Jungwook Choi; Howard Haynie; Alyssa Herbert; Radhika Jain; Monodeep Kar; Kyu-Hyoun Kim; Yulong Li; Zhibin Ren; Scot Rider; Marcel Schaal; Kerstin Schelm; Michael Scheuermann; Xiao Sun; Hung Tran; Naigang Wang; Wei Wang; Xin Zhang; Vinay Shah; Brian Curran; Vijayalakshmi Srinivasan; Pong-Fei Lu; Sunil Shukla; Leland Chang; Kailash Gopalakrishnan

doi:10.1109/ISSCC42613.2021.9365791

ISSCC 2021

Conference paper

13 Feb 2021

A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling

View publication

Abstract

Low-precision computation is the key enabling factor to achieve high compute densities (T0PS/W and T0PS/mm2) in AI hardware accelerators across cloud and edge platforms. However, robust deep learning (DL) model accuracy equivalent to high-precision computation must be maintained. Improvements in bandwidth, architecture, and power management are also required to harness the benefit of reduced precision by feeding and supporting more parallel engines to achieve high sustained utilization and optimize performance within a given product power envelope. In this work, we present a 4-core AI chip in 7nm EUV technology that exploits cutting-edge algorithmic advances for iso-accurate models in low-precision training and inference [1, 2] and aggressive circuit/architecture optimization to achieve leading-edge power-performance. The chip supports fp16 (DLFIoat16 [8]) and hybrid-fp8(hfp8) [1] formats for training and inference of DL models, as well as int4 and int2 formats for highly scaled inference.

Conference paper

Solving optimization tasks power-efficiently exploiting VO₂'s phase-change properties with Oscillating Neural Networks

Olivier Maher, N. Harnack, et al.

DRC 2023

Paper

Filamentary TaO_x/HfO₂ ReRAM Devices for Neural Networks Training with Analog In-Memory Computing

Tommaso Stecconi, Roberto Guido, et al.

Advanced Electronic Materials

Paper

A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

Sae Kyu Lee, Ankur Agrawal, et al.

IEEE JSSC

Conference paper

A Multiscale Workflow for Thermal Analysis of 3DI Chip Stacks

Max Bloomfield, Amogh Wasti, et al.

ITherm 2025

View all publications

Abstract

Related

Solving optimization tasks power-efficiently exploiting VO2's phase-change properties with Oscillating Neural Networks

Filamentary TaOx/HfO2 ReRAM Devices for Neural Networks Training with Analog In-Memory Computing

A 7-nm Four-Core Mixed-Precision AI Chip with 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

A Multiscale Workflow for Thermal Analysis of 3DI Chip Stacks

Solving optimization tasks power-efficiently exploiting VO₂'s phase-change properties with Oscillating Neural Networks

Filamentary TaO_x/HfO₂ ReRAM Devices for Neural Networks Training with Analog In-Memory Computing