AI Accelerator on IBM Telum Processor Industrial Product*

Cedric Lichtenau; Alper Buyuktosunoglu; Ramon Bertran; Peter Figuli; Christian Jacobi; Nikolaos Papandreou; Haris Pozidis; Anthony Saporito; Andrew Sica; Elpida Tzortzatos

doi:10.1145/3470496.3533042

ISCA 2022

Conference paper

17 Jun 2022

AI Accelerator on IBM Telum Processor Industrial Product*

View publication

Abstract

IBM Telum is the next generation processor chip for IBM Z and LinuxONE systems. The Telum design is focused on enterprise class workloads and it achieves over 40% per socket performance growth compared to IBM z15. The IBM Telum is the frst server-class chip with a dedicated on-chip AI accelerator that enables clients to gain real time insights from their data as it is getting processed. Seamlessly infusing AI in all enterprise workloads is highly desirable to get real business insight on every transaction as well as to improve IT operation, security, and data privacy. While it would undeniably provide signifcant additional value, its application in practice is often accompanied by hurdles from low throughput if run on-platform to security concerns and inconsistent latency if run off-platform. The IBM Telum chip introduces an on-chip AI accelerator that provides consistent low latency and high throughput (over 200 TFLOPS in 32 chip system) inference capacity usable by all threads. The accelerator is memory coherent and directly connected to the fabric like any other general-purpose core to support low latency inference while meeting the system's transaction rate. A scalable architecture providing transparent access to AI accelerator functions via a non-privileged general-purpose core instruction further reduces software orchestration and library complexity as well as provides extensibility to the AI functions. On a global bank customer credit card fraud detection model, the AI accelerator achieves 22× speed up in latency compared to a general purpose core utilizing vector execution units. For the same model, the AI accelerator achieves 116k inferences every second with a latency of only 1.1 msec. As the system is scaled up from one chip to 32 chips, it performs more than 3.5 Million inferences/sec and the latency still stays very low at only 1.2 msec. This paper briefly introduces the IBM Telum chip and later describes the integrated AI accelerator. IBM Telum's AI accelerator architecture, microarchitecture, integration into the system stack, performance, and power are covered in detail.

Conference paper