About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
AGU 2024
Talk
4M4EO – Massively Multi-Modal Masked Autoencoders for Earth Observation
Abstract
Earth observation generates datasets from diverse modalities such as different satellite missions, a range of higher-level data products, and corresponding meta-data. Deep learning has been struggling with leveraging diverse sets of multi-modal data within a single architecture due to the missing spatio-temporal alignment of the modalities, the partial unavailability of certain modalities, and the vastly different modeling requirements for each of the modalities. To mitigate those challenges, we collect a global-scale dataset of more than 9 million spatio-temporal aligned samples from the Copericus and Landsat missions, higher-level products like Digital Elevation Maps, Land-Use-Land-Change and Canopy-Height datasets, as well as meta-data like textual descriptions and Open Street Map. On top of this data and significant HPC compute, we train a single multi-modal foundation model inspired by models developed for the natural image domain, such as 4M [1]. Our 4M4EO model includes modality-specific tokenizers and a joint masked latent space reconstruction which deals with partial spatio-temporal unavailability. Our approach is two-staged: First, it leverages a single, task-specific encoder-decoder architecture per modality. Second, the resulting embeddings are combined via masked latent space reconstruction. Thus, the approach first learns to embed the data in individual latent spaces and then learns to align the latent spaces through masked latent space reconstruction. We delve in diverse experiments, such as the design of the task-specific encoder-decoder in stage 1 comparing vector-quantization with finite-scalar-quantization as well as diffusion-based with GAN-based approaches. For the second stage, we provide a detailed comparison of different late fusion approaches comparing MLP-based late fusion with masked latent space reconstruction. We believe that our work is pivotal for achieving massive multi-modality for Earth Observation. [1] Mizrahi et al. "4M: Massively multimodal masked modeling." Advances in Neural Information Processing Systems 36 (2024).