The goal is tunable, reusable foundation models that make it easier to mine vast datasets for new knowledge to advance science and help us adapt to a changing environment.
Climate change is creating more of the heat and drought that fuel wildfires. How will the smoke impact air quality? How will the searing heatwaves of the past year impact corn and wheat yields?
These are just some of the questions that scientists would like to answer by combing through millions of earth science papers and mining mountains of satellite images.
In a new collaboration, NASA and IBM are creating AI foundation models to analyze petabytes of text and remote-sensing data to make it easier to build AI applications tailored to specific questions and tasks.
“It won’t just be NASA that benefits, other agencies and organizations will too,” said Rahul Ramachandran, a senior research scientist at NASA’s Marshall Space Flight Center. “We hope that these models will make information and knowledge more accessible to everyone and encourage people to build applications that make it easier to use our datasets to make discoveries and decisions based on the latest science.”
Foundation models ingest massive amounts of raw data, and with no explicit instruction, find their underlying structure. Pre-train a foundation model, and you can teach it an entirely new task with a limited set of hand-labeled examples. Traditionally, one of the main bottlenecks to applying machine learning to remote sensing data was a shortage of training examples — things like trees and crops in satellite data segmented and labeled by human experts so the computer knows what features to focus on.
The recent introduction of transformer-based models could potentially lift this roadblock. NASA is sitting on 70 petabytes of earth science data, a number expected to reach 600 petabytes by 2030 with the launch of a dozen new missions including Surface Water and Ocean Topography (SWOT) and NISAR.
Ramachandran is hopeful that foundation models can multiply the usefulness of NASA data. It’s part of a broader NASA push to make data, code, and AI models available to everyone through its Open-Source Science Initiative.
IBM and NASA will build two foundation models. The first will be trained on reams of earth science journals to thematically organize the literature and make it easier to search and discover new knowledge. The second model will be trained on USGS and NASA’s popular dataset, Harmonized Landsat Sentinel-2 (HLS), a record of land-use changes captured by Earth-orbiting satellites. Downstream applications include detecting natural hazards and tracking changes to vegetation and wildlife habitat for natural resource management.
Much of the literature in earth science is dense and often lacking the context that would allow non-experts to dive in and quickly get up to speed on a topic. A foundation model to organize it all could vastly simplify the search and discovery process.
IBM has built a foundation model with nearly 300,000 articles from journals published by AGU and AMS, among other scientific organizations. Researchers are now in the process of fine-tuning the language model, and with NASA’s guidance, creating earth-science specific benchmarks to measure the model’s performance.
Once fully trained, the model will be used with PrimeQA, IBM’s open-source multilingual question-answering system. Type in a question like, “How will aerosols impact climate change?” and the model will eventually be able to summarize an answer from the most recent, cited papers, with links and any relevant context.
Potential users include earth and data scientists, policy makers, and the public. The goal is to unlock this hidden knowledge so more people can use it, said Ramachandran. The model could help users find relevant research and datasets in their area of interest. It could also help to identify topics of emerging interest, allowing NASA and other science agencies to prioritize resources to fill those knowledge gaps.
The foundation model is built on a portable middleware stack to make it easier to scale and iteratively improve on. It’s one of the largest AI workloads trained on Red Hat’s OpenShift software to date, said Tushar Katarki, who heads product management for OpenShift.
“Training a large language model involves multiple machines working together,” said IBM researcher Bishwaranjan Bhattacharjee. “Building the AI training platform on OpenShift lets you train the model anywhere without being tied to specific public and private clouds.”
In a second project underway, IBM is building a foundation model to make it easier to develop AI applications to analyze satellite data at scale. IBM is starting with the HLS dataset, and if successful, will tackle MERRA-2, a dataset that combines aerosol observations from space with modeling of Earth’s climate system, that could improve applications for weather and climate prediction.
“The beauty of foundation models is they can potentially be used for many downstream applications,” said Ramachandran. “You need just a few samples to tune the model to make predictions as accurately as one built with large amounts of manually labeled data.”
Scientists use HLS to understand and predict where forests are being cut down, cities are expanding, and crop yields are set to increase or decline. Practical applications include tracking deforestation, coordinating responses to natural disasters, monitoring mining sites, and tracking invasive species.
Most foundation models until now have used transformers on sequences of words. But researchers hope that transformers can structure images just as efficiently. “Our results are promising so far,” said IBM researcher Raghu Ganti. “If we can build a foundation model for HLS, there are dozens of other remote-sensing datasets that could benefit, paving the way for new applications and discoveries.”
This work comes as NASA prepares to launch its Year of Open Science, which will feature events throughout 2023 to promote data and AI-model sharing to accelerate scientific discovery. Foundation models offer a potentially huge leap for the open-science movement, said Ramachandran.
“These large problems cannot be tackled by small teams,” he added. “You need teams across different organizations to bring their different perspectives, resources, and skill sets.”