Inference is the process of running live data through a trained AI model to make a prediction or solve a task.
Inference is an AI model’s moment of truth, a test of how well it can apply information learned during training to make a prediction or solve a task. Can it accurately flag incoming email as spam, transcribe a conversation, or summarize a report?
During inference, an AI model goes to work on real-time data, comparing the user’s query with information processed during training and stored in its weights, or parameters. The response that the model comes back with depends on the task, whether that’s identifying spam, converting speech to text, or distilling a long document into key takeaways. The goal of AI inference is to calculate and output an actionable result.
Training and inference can be thought of as the difference between learning and putting what you learned into practice. During training, a deep learning model computes how the examples in its training set are related, encoding these relationships in the weights that connect its artificial neurons. When prompted, the model generalizes from this stored representation to interpret new, unseen data, in the same way that people draw on prior knowledge to infer the meaning of a new word or make sense of a new situation.
The artificial neurons in a deep learning model are inspired by neurons in the brain, but they’re nowhere near as efficient. Training just one of today’s generative models can cost millions of dollars in computer processing time. But as expensive as training an AI model can be, it’s dwarfed by the expense of inferencing. Each time someone runs an AI model on their computer, or on a mobile phone at the edge, there’s a cost — in kilowatt hours, dollars, and carbon emissions.
Because up to 90% of an AI-model’s life is spent in inference mode, the bulk of AI’s carbon footprint is also here, in serving AI models to the world. By some estimates, running a large AI model puts more carbon into the atmosphere over its lifetime than the average American car.
“Training the model is a one-time investment in compute while inferencing is ongoing,” said Raghu Ganti an expert on foundation models at IBM Research. “An enterprise might have millions of visitors a day using a chatbot powered by Watson Assistant. That’s a tremendous amount of traffic.”
All that traffic and inferencing is not only expensive, but it can lead to frustrating slowdowns for users. IBM and other tech companies, as a result, have been investing in technologies to speed up inferencing to provide a better user experience and to bring down AI’s operational costs.
How fast an AI model runs depends on the stack. Improvements made at each layer — hardware, software, and middleware — can speed up inferencing on their own and together.
Developing more powerful computer chips is an obvious way to boost performance. One area of focus for IBM Research has been to design chips optimized for matrix multiplication, the mathematical operation that dominates deep learning.
Telum, IBM’s first commercial accelerator chip for AI inferencing, is an example of hardware optimized for this type of math. As is IBM's prototype Artificial Intelligence Unit (AIU) and work on analog AI chips.
Another way of getting AI models to run faster is to shrink the models themselves. Pruning excess weights and reducing the model’s precision through quantization are two popular methods for designing more efficient models that perform better at inference time.
A third way to accelerate inferencing is to remove bottlenecks in the middleware that translates AI models into operations that various hardware backends can execute to solve an AI task. To achieve this, IBM has collaborated with developers in the open-source PyTorch community.
Part of the Linux Foundation, PyTorch is a machine-learning framework that ties together software and hardware to let users run AI workloads in the hybrid cloud. One of PyTorch’s key advantages is that it can run AI models on any hardware backend: GPUs, TPUs, IBM AIUs, and traditional CPUs. This universal framework, accessed via Red Hat OpenShift, gives enterprises the option of keeping sensitive AI workloads on-premises while running other workloads on public and private servers in the hybrid cloud.
Middleware may be the least glamorous layer of the stack, but it’s essential for solving AI tasks. At runtime, the compiler in this middle layer transforms the AI model’s high-level code into a computational graph that represents the mathematical operations for making a prediction. The GPUs and CPUs in the backend carry out these operations to output a solution.
Serving large deep learning models involves a ton of matrix multiplication. For this reason, cutting even small amounts of unnecessary computation can lead to big performance gains. In the last year, IBM Research worked with the PyTorch community and adopted two key improvements in PyTorch. PyTorch Compile supports automatic graph fusion to reduce the number of nodes in the communication graph and thus the number of round trips between a CPU and a GPU; PyTorch Accelerated Transformers support kernel optimization that streamlines attention computation by optimizing memory accesses, which remains the primary bottleneck for large generative models.
Recently, IBM Research added a third improvement to the mix: parallel tensors. The biggest bottleneck in AI inferencing is memory. Running a 70-billion parameter model requires at least 150 gigabytes of memory, nearly twice as much as a Nvidia A100 GPU holds. But if the compiler can split the AI model’s computational graph into strategic chunks, those operations can be spread across GPUs and run at the same time.
Inferencing speeds are measured in something called latency, the time it takes for an AI model to generate a token — a word or part of word— when prompted. When IBM Research tested its three-lever solution (graph fusion, kernel optimization, and parallel tensors) on a 70-billion parameter Llama2 model, researchers achieved a 29-millisecond-per-token latency at 16-bit inferencing. The solution will represent a 20% improvement over the current industry standard once its made operational.
Each of these techniques had been used before to improve inferencing speeds, but this is the first time all three have been combined. IBM researchers had to figure out to get the techniques to work together without cannibalizing the others’ contributions. “It’s like three people fighting with each other and only two are friends,” said Mudhakar Srivatsa, an expert on inference optimization at IBM Research.
To further boost inferencing speeds, IBM and PyTorch plan to add two more levers to the PyTorch runtime and compiler for increased throughput. The first, dynamic batching, allows the runtime to consolidate multiple user requests into a single batch so each GPU can operate at full capacity. The second, quantization, allows the compiler to run the computational graph at lower precision to reduce its load on memory without losing accuracy. Join IBM researchers for a deep dive on this and more at the 2023 PyTorch Conference Oct. 16-17 in San Francisco.