Charting a path to a more sustainable AI future
IBM Research’s AIU family of prototype chip designs point the way to a future where AI computation is more efficient, less power hungry, and more capable.
IBM Research’s AIU family of prototype chip designs point the way to a future where AI computation is more efficient, less power hungry, and more capable.
The way the world computes is shifting. We’ve long relied on computers for many aspects of our lives and work, but over the last decade, the world has seen an explosion in demand for AI applications.
AI model sizes are growing exponentially, but the hardware to train and run these behemoth models in the cloud (or on edge devices like smartphones and sensors) hasn’t advanced nearly as quickly. AI software generally requires a great deal of power and memory, and the computer chip designs of yesterday are not adequately suited to handle the demands of modern AI systems. Even the most powerful chips of today are primarily designed for handling lots of simple tasks at once, or being extremely effective at one type of task, like rendering graphics. To lower latency and increase chip speeds, the memory and processing units need to be closer than ever before.
It’s why for years now, IBM Research has been exploring what the future of AI devices looks like. In 2019, it founded the AI Hardware Center, a global research hub headquartered in Albany, New York, where IBM works with academia and industry partners across the globe to uncover new ways to make running AI models easier and more energy efficient.
IBM researchers have been focused on developing new AI hardware architectures that could meet the performance requirements of today’s AI models. In 2022, IBM Research unveiled the prototype AIU, or artificial intelligence unit. Separate from a GPU or CPU, AIUs are built from the ground up to specifically handle AI workloads. The original prototype AIU chip wasn’t the only workstream focused on making AI models more efficient and cost effective to run — IBM researchers have an entire family of ideas they’re exploring.
Today, at the 2024 AI Hardware Forum at Yorktown Heights, IBM Semiconductors General Manager and Vice President of Hybrid Cloud Research Mukesh Khare described several research workstreams that are part of the AIU family of accelerators. These family members are in different stages of maturity, and represent the various ways that IBM is thinking about the future of chip design for AI — not just for tomorrow, but for years to come.
“We’re IBM Research,” Khare said, "We need to look at things that can be turned into a product in the near term, as well as things that could be revolutionary a decade from now.”
IBM Spyre is the most mature of the AIU family members. In August, IBM announced the IBM Spyre AI accelerator, a commercial system-on-a-chip, that will allow future IBM Z systems to perform AI inferencing at an even greater scale than they can today. It will soon be available for businesses to incorporate into their AI workflows.
Spyre was inspired by the original AIU prototype, a device where both CPU and AI cores were tightly integrated on the same chip. The prototype was first shown off in 2022, and was built on top of a 5 nm node platform, and contained 32 cores with 23 billion transistors. It was designed to send data directly from one compute engine to the next, creating enormous energy savings.
The IBM Research and Infrastructure teams have since worked together to evolve the AIU prototype into an enterprise-grade product so it could be incorporated into the next generation of Z mainframes. The result of that work is the IBM Spyre AIU accelerator. The production-ready accelerator shares a very similar architecture to that first prototype. Spyre has 32 individual accelerator cores onboard, and contains 25.6 billion transistors using 14 miles of wire. It was also produced using 5 nm node process technology, and each Spyre is mounted on a PCIe card. Cards can be clustered together — for example, a cluster of 8 cards adds 256 additional accelerator cores to a single IBM Z system.
Even with Spyre in production, the teams are working on what’s next. Infrastructure is working on its production models, and recently announced that Spyre Accelerator will be integrated into the forthcoming Power11 system that will launch in 2025. IBM Research is also continuing to test and refine the IBM Spyre AIU for experimental purposes.
Both the original AIU prototype and IBM Spyre are application-specific integrated circuits (ASICs). They’re designed for deep learning and can be programmed to run any type of deep-learning task, whether that’s processing spoken language, or generative AI capabilities. It also can be plugged into any computer or server with a PCIe slot.
Late last year, IBM Research installed a cluster of around 100 networked IBM Spyre AIU cards into a new system in its Yorktown Heights headquarters. The rack design showcases how AIUs could fit into dataspaces of the future, and features chips working on real production AI workloads. They’re currently running HAP (hate, abuse, and profanity) filtering for models appearing on IBM’s data and AI platform, watsonx. This workload is currently running is using roughly eight times less power, with comparable throughput, to a cluster of GPUs optimized for training.
IBM has also installed a rack of IBM Spyre AIU cards at the University at Albany earlier this year, which will let students and faculty run complex AI models on campus to help them explore and advance the state of the art in generative AI. And today at the AI Hardware Forum, IBM announced that a set of IBM Spyre AIU accelerators will be installed at the University of Alabama, Huntsville — just down the road from NASA’s Marshall Space Flight Center. Researchers will use the cards for several generative AI experiments, including running the weather and climate models developed jointly by the university, IBM, and NASA. In early tests, this cluster can run the IBM-NASA models with three times more efficiency than GPUs — something the researchers will continue to test and refine.
On the other side of the country in IBM Research’s lab in Almaden, California, another group of researchers has been working for the better part of two decades on a completely different approach to computing AI — taking inspiration from the structures of the brain in silicon. Our brains are the most powerful, energy-efficient processors we know, but creating any structure that approximates neurons digitally has been a massive challenge.
A team led by IBM Fellow Dharmendra Modha has been working on building chips that mimic the complex structures of various animals over the years, from a worm to a bee, to a rat, and a cat. The goal has been to use the “left brain” calculations that traditional processors excel at and combine them with synaptic designs more like how the right side of our brains operate to reason and process imagery. This resulted in the TrueNorth chip, which IBM Research first unveiled in 2014.
Much like the team on the east coast, Modha’s team in the west has been trying to break through what’s called the von Neumann bottleneck to decrease latency. Since the dawn of the semiconductor industry, computer chips have tended to separate processing and memory into simple, discrete devices that have scaled well over the decades. But with AI inferencing, where so much information needs to be processed in such a short period of time, precious time and energy is wasted by sending information between processors and memory devices. To break that bottleneck, Modha’s team worked to store memory on the same device as the processing. It’s a design called “near-memory computing,” inspired by how information is often stored and processed in the brain in nearby locations.
For the last decade, the team has been working to turn this idea into a reality, finally unveiling AIU North Pole, their newest chip prototype, in 2023. Under certain conditions, it can carry out AI inferencing considerably faster than most other chips for AI on the market today — and using much less power.
AIU NorthPole was fabricated with a 12 nm transistor node process and contains 22 billion transistors in a 795 square mm space. It has 256 cores and can perform 2,048 operations per core per cycle at 8-bit precision. AIU NorthPole tightly couples memory with the chip’s compute and control units, leading to a massive 13 terabytes per second on-chip memory bandwidth. But the biggest advantage of AIU NorthPole is also a constraint: it can only easily pull from the memory it has onboard. All the speedups of the chip would be undercut if it had to access information from another place.
The team, however, is now looking at how NorthPole can be used to support the growing trend of running smaller LLMs for specific use cases. They recently showed off that there can potentially be great value in running these sorts of models on an AIU NorthPole. In inference tests run on a 3-billion-parameter LLM developed from IBM’s Granite-8B-Code-Base model, the team was able to get latency below 1 millisecond per token — nearly 47 times faster than the next most energy-efficient GPU.
They put 16 AIU NorthPoles, communicating via PCIe, in an off-the-shelf 2U server, and found that they could attain a throughput of 28,356 tokens per second on that same model. It reached these speeds, while still being over 70 times more energy efficiency than the next lowest latency GPU. To run the model, the team mapped its 14 transformer layers onto one card each, and mapped the output layer onto the remaining two cards. In those experiments, NorthPole was 25 times more energy efficient than popular 12 nm GPUs and 14 nm CPUs, as measured by the number of frames interpreted per unit of power.
While AIU NorthPole is still a research prototype that’s being continuously refined, it is already showing very promising results, and is available for clients to experiment with and evaluate in their own infrastructure.
Perhaps the most experimental project in the AIU family lineup revolves around the concept of analog AI.
While the other members of the family are in many ways considerably more efficient than other processors available today for AI, they pale in comparison to the computing efficiency of our brains, which use just a fraction of kWh each day. Teams of researchers at IBM are working on chips that are analog — meaning they require next to no power to run. Figuring out how to crack the secret of the brain’s efficiency in computer chips could be as disruptive for the future of computing as quantum computing is poised to be.
In 2021, IBM developed chips that use phase-change memory (PCM). This is where an electrical pulse is applied to a special type of material that changes conductance when electricity passes through it. The material switches between amorphous and crystalline phases, where a lower electrical pulse will make the device more crystalline, providing less resistance, and a higher pulse makes the device amorphous, resulting in more resistance. Instead of recording these states as 0s or 1s you would see in digital systems, the PCM device records its current state as a continuum of values between the amorphous and crystalline states. This value is called a synaptic weight, which can be stored in the physical atomic configuration of each PCM device. These weights are retained when the power supply is switched off.
These weights allow you to encode the weight of a neural network directly onto an analog AI chip. In recent work published in Nature, IBM researchers were able to show that they could encode an AI model with 17 million parameters onto an analog device they created. That’s roughly two orders of magnitude smaller than a model like GPT-3, but points to a future where models could be encoded directly onto a PCM device, and then require next to no power to use after the device’s weights are set. The device this team built was also used to encode a model that listened for one from a set of 12 words to be spoken, much like consumer devices listen for wake words like “Hey Siri.” In their tests, their device was able to recognize the words faster than existing consumer software systems, and required a fraction of the cost to do so.
IBM researchers are working on a host of other uses for analog AI. In a paper published in Nature Electronics, a team showed they could use a different, but also energy-efficient, analog chip design for scalable mixed-signal architecture that can achieve very high accuracy in the CIFAR-10 image dataset for computer vision image recognition. The chip was designed and built by IBM at the Albany NanoTech Complex, and is composed of 64 analog in-memory compute cores, each of which contains a 256-by-256 crossbar array of PCM synaptic unit cells.
To move the idea of analog AI from concept to reality, there are a few hurdles that need to be overcome. These devices need to be able to compute at roughly the same level of precision that we see in today’s digital devices, and they need to be able to interface with the digital devices we use for other uses, such as digital communication systems. The research team that built the mixed-signal architecture analog chip see their work as going some way to address these challenges, by creating a chip that can interact with digital devices, and perform well on certain AI tasks, while sipping on electricity.
Analog AI devices are still very much in the research phase, but as with the rest of the AIU family, IBM Research is looking to build a vibrant ecosystem and platform around analog AI. As part of that, IBM Research open-sourced part of its analog AI simulation framework, called AIHWKit. It’s a toolkit, integrated with PyTorch, that simulates analog crossbar arrays and allows users to estimate what impact an analog device’s material imperfections might have on the accuracy of an AI model.