For the most parts, model weights are stationary, and AI computing is memory-centric, rather than compute heavy, said Le Gallo-Bourdeau. “You have a fixed set of synaptic weights, and you just need to propagate data through them.”
This quality has enabled him and his colleagues to pursue analog in-memory computing, which integrates memory with processing, using the laws of physics to store weights. One of these approaches is phase-change memory (PCM), which stores model weights in the resistivity of a chalcogenide glass, which is changed by applying an electrical current.
“This way we can reduce the energy that is spent in data transfers and mitigate the von Neumann bottleneck,” said Le Gallo-Bourdeau. In-memory computing isn’t the only way to work around the von Neumann bottleneck, though.
The AIU NorthPole is a processor that stores memory in digital SRAM, and while its memory isn’t intertwined with compute in the same way as analog chips, its numerous cores each has access to local memory — making it an extreme example of near-memory computing. Experiments have already demonstrated the power and promise of this architecture. In recent inference tests run on a 3-billion-parameter LLM developed from IBM’s Granite-8B-Code-Base model, NorthPole was 47 times faster than the next most energy-efficient GPU and was 73 times more energy efficient than the next lowest latency GPU.
It’s also important to note that models trained on von Neumann hardware can be run on non-von Neumann devices. In fact, for analog in-memory computing, it’s essential. PCM devices aren’t durable enough to have their weights changed over and over, so they’re used to deploy models that have been trained on conventional GPUs. Durability is a comparative advantage of SRAM memory in near-memory or in-memory computing, as it can be rewritten infinitely.
While von Neumann architecture creates a bottleneck for AI computing, for other applications, it’s perfectly suited. Sure, it causes issues in model training and inference, but von Neumann architecture is perfect for processing computer graphics or other compute-heavy processes. And when 32- or 64-bit floating point precision is called for, the low precision of in-memory computing isn’t up to the task.
“For general purpose computing, there's really nothing more powerful than the von Neumann architecture,” said Burr. Under these circumstances, bytes are either operations or operands that are moving on a bus from a memory to a processor. “Just like an all-purpose deli where somebody might order some salami or pepperoni or this or that, but you're able to switch between them because you have the right ingredients on hand, and you can easily make six sandwiches in a row.” Special-purpose computing, on the other hand, may involve 5,000 tuna sandwiches for one order — like AI computing as it shuttles static model weights.
Even when building their in-memory AIU chips, IBM Researchers include some conventional hardware for the necessary high-precision operations.
Even as scientists and engineers work on new ways to eliminate the von Neumann bottleneck, experts agree that the future will likely include both hardware architectures, said Le Gallo-Bourdeau. “What makes sense is some mix of von Neumann and non-von Neumann processors to each handle the operations they are best at.”