4M4EO – Massively Multi-Modal Masked Autoencoders for Earth Observation
Abstract
Earth observation generates datasets from diverse modalities such as different satellite missions, a range of higher-level data products, and corresponding meta-data. Deep learning has been struggling with leveraging diverse sets of multi-modal data within a single architecture due to the missing spatio-temporal alignment of the modalities, the partial unavailability of certain modalities, and the vastly different modeling requirements for each of the modalities. To mitigate those challenges, we collect a global-scale dataset of more than 9 million spatio-temporal aligned samples from the Copericus and Landsat missions, higher-level products like Digital Elevation Maps, Land-Use-Land-Change and Canopy-Height datasets, as well as meta-data like textual descriptions and Open Street Map. On top of this data and significant HPC compute, we train a single multi-modal foundation model inspired by models developed for the natural image domain, such as 4M [1]. Our 4M4EO model includes modality-specific tokenizers and a joint masked latent space reconstruction which deals with partial spatio-temporal unavailability. Our approach is two-staged: First, it leverages a single, task-specific encoder-decoder architecture per modality. Second, the resulting embeddings are combined via masked latent space reconstruction. Thus, the approach first learns to embed the data in individual latent spaces and then learns to align the latent spaces through masked latent space reconstruction. We delve in diverse experiments, such as the design of the task-specific encoder-decoder in stage 1 comparing vector-quantization with finite-scalar-quantization as well as diffusion-based with GAN-based approaches. For the second stage, we provide a detailed comparison of different late fusion approaches comparing MLP-based late fusion with masked latent space reconstruction. We believe that our work is pivotal for achieving massive multi-modality for Earth Observation. [1] Mizrahi et al. "4M: Massively multimodal masked modeling." Advances in Neural Information Processing Systems 36 (2024).