We remain the leaders in driving reduced precision for AI models [Figure 1], with industry-wide adoption. We’ve extended reduced precision formats to 8-bit for training and 4-bits for inference and developed data communication protocols that enable AI cores on a multiple-core chip to exchange data effectively with each other. Most recently, our team demonstrated Read more about this in our piece, 'IBM breakthroughs could help bring AI training from cloud to edge.'4-bit formats for training at NeurIPS 2020.
Our new ISSCC paper reflects the latest stage in these advancements, focused on the creation of a chip that is highly optimized for low-precision training and inference for all of the different AI model types — without any loss of quality at the application level.
We showcase several novel characteristics of the chip. To start with, it’s the first silicon chip ever to incorporate ultra-low precision hybrid FP8 (HFP8) formats for training deep-learning models in a state-of-the-art silicon technology node (7 nm EUV-based chip). Also, the raw power efficiency numbers are state of the art across all different precisions. The table in Figure 3 shows that our chip performance and power efficiency exceed other that of dedicated inference and training chips.
But this is not all. It’s one of the first chips to incorporate power management in AI hardware accelerators. In this research, we show that we can maximize the performance of the chip within its total power budget, by slowing it down during computation phases with high power consumption.
Finally, we demonstrate that our chip, in addition to great peak performance, has high sustained utilization that translates to real application performance and is a key part of engineering our chip for energy efficiency. Our chips routinely achieve more than 80 percent utilization for training and more than 60 percent utilization for inference — as compared to typical GPU utilizations that are typically well below 30 percent utilization.
Our new AI core and chip can be used for many new cloud to edge applications across multiple industries. For instance, they can be used for cloud training of large-scale deep learning models in vision, speech and natural language processing using 8-bit formats (vs. the 16- and 32-bit formats currently used in the industry). They can also be used for cloud inference applications, such as for speech to text AI services, text to speech AI services, NLP services, financial transaction fraud detection and broader deployment of AI models in financial services.
Autonomous vehicles, security cameras and mobile phones can benefit from it too, and it can be handy for federated learning at the edge for customization, privacy, security and compliance.
We hope that through this work, we can establish an entirely new way of creating and deploying AI models that scale performance and cut power consumption. Please check out the IBM Research AI Hardware Center for more information about our research and our team.
Agrawal, A. et al. 9.1 A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling. in 2021 IEEE International Solid- State Circuits Conference (ISSCC) vol. 64 144–146 (2021). ↩