Foundation models are fantastic at predicting the next word. Could future events be next?
The past is often an excellent guide to the future, making historical observations key to anticipating what will happen next, whether it’s the likelihood of a heat wave or a heart attack, or estimating whether a company will hit its sales goal, or the electrical grid will sustain peak demand.
In statistics, historical observations are known as time series data, measurements that go up and down over time, things like air temperature, heart rate, retail sales, energy use, and so on. Time series forecasting involves analyzing a string of past values to predict some future value.
Traditionally, statistical methods have been used to identify patterns in the data that can be projected forward to make better decisions. The longer the time series, the better the forecast. But the farther back you go in time, the more complex these calculations get, especially if your target variable is influenced by other factors. Air conditioner sales, for example, may be tied to the economy and the price of electricity in addition to how warm the weather feels outside. The more interacting variables there are in any time series data set, the harder it can be to isolate the signal that foreshadows the future.
Transformers can improve the skill of traditional forecasting by extracting patterns over longer time periods, and among related variables, to predict what lies ahead. Just as large language models (LLMs) compute relationships among words to “learn” the structure of language, time series transformers look for patterns in historical observations to ”understand” a dynamic process. In both cases, these abstract representations are what allow the models to solve predictive tasks.
But measurements pose complications for transformers in ways that words and pixels do not. There are three main barriers to adapting transformers to time series data.
Time series data is often continuous: Think of video streaming from a self-driving car or heart rate data from a smart watch. Not only is there tons of data to process, but its sequential order must be strictly preserved.
Time series data is also staggeringly varied: from stock prices and satellite images to brain waves and light curves from distant stars. The data crosses virtually every domain; time pervades everything. The lack of a unifying theme makes compressing disparate observations into an abstract representation an enormous challenge.
Lastly, different sets of time series data are often highly correlated. In the real world, complex events arise from multiple factors. Air temperature, pressure, and humidity, for example, strongly interact with each other to drive the weather. To predict a hurricane, you must know how these variables influenced each other in the past to understand how the future could play out. As the number of variables increases, the computations quickly become overwhelming, especially if it’s a long historical record.
The technical challenges are still being worked on, but transformers are already showing promise over traditional forecasting methods. No matter what business you’re in, the prospect of more accurate forecasts holds tremendous value.
“Time series forecasting is the bread and butter of what drives an enterprise,” said IBM’s Jayant Kalagnanam. “Every organization has a forecast of sales, demand, revenue, and capacity requirements. Accurate and reliable forecasts could save billions of dollars.”
IBM Research has been at the forefront of bringing time series forecasting into the foundation model era. At the KDD conference in 2021, IBM researchers were the first to introduce a transformer that could process raw time series data with multiple variables. Their eponymous Time Series Transformer (TST), touched off what IBM’s Vijay Ekambaram has called the “battle of time series transformers,” as transformers with names like Informer, Autoformer, and Fedformer followed IBM’s lead.
IBM has continued to push the field forward with its PatchTST and PatchTSMixer models, unveiled in 2023 at the International Conference on Learning Representations (ICLR) and KDD conferences, respectively. In experiments, both models outperformed other transformers on weather, traffic, and electrical load forecasting tasks by up to 60%. Researchers have made their code public on the open-source AI platform, Hugging Face, for others to build their own time series forecasting models with their own data. IBM will use feedback from users to improve its PatchTST and PatchTSMixer.
On a separate track, IBM is also adapting pre-trained LLMs for time series forecasting. The typical LLM has absorbed the underlying structure of language by training on internet-scale data. Time series data is far less abundant, limited to about 500,000 or so public datasets compiled by researchers at Australia’s Monash University. “There’s no Wikipedia for time series data,” said IBM's Nam Nguyen.
Reprogramming LLMs for time series forecasting is one way to stretch this meager resource farther, said IBM’s Pin-Yu Chen. In upcoming papers at AAAI and ICLR, Chen and colleagues show to repurpose an LLM by applying a set of reprogramming principles and two additional layers of artificial neurons trained on historical observations. In this way, a text classifier can be transformed into a tool for analyzing electricity use or designing antibody proteins.
But because language models are typically large and handle problems with just one variable, they can be slow and limiting. To add speed and flexibility, Kalagnanam and colleagues have devised a faster foundation model for multivariable forecasting, aptly named Tiny Time Mixers. At just under 1 million parameters, this new model is up to 65 times faster to tune and run. It can be trained in as little as eight hours, on limited data, and still outperform billion-parameter models on a range of few-shot forecasting tasks.
The applications for a better forecasting engine are nearly endless. IBM is currently working with clients on disparate problems, from optimizing the production of cement and granulated sugar to improving the quality of simulations tied to uncovering the nature of the universe.
IBM has found immediate business applications for transformers in their current state. But the eventual goal is to design architectures that can integrate data from anywhere into a general representation. Their vision is a transformer that can extract the structure of dynamic events in nature or the built environment and transfer this knowledge to never-before-seen domains to accurately predict the future.
“Time series data has no common dictionary of patterns across domains,” said IBM’s Ekambaram. “That’s what makes this problem so challenging — and potentially rewarding.”