About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
AGU 2024
Talk
MS-CLIP: Multi-spectral Vision Language Learning for Earth Observation
Abstract
Recent Vision-Language Models (VLMs) have enabled a wide range of new tasks in the general vision domain, such as zero-shot classification and cross-modal retrieval. However, existing VLMs are limited to RGB data and do not leverage the full potential of multi-spectral satellite data. We used continual pre-training with the CLIP model [1] to create a first-of-its-kind VLM that can process multi-spectral data, focusing on low-resolution satellite imagery from Sentinel-2. Our model, MS-CLIP, employs the dual encoder architecture of CLIP with an adapted patch embedding for multi-spectral input. The model is trained with contrastive learning that minimizes the distance between vision and text embeddings. Additionally, this work includes building a large-scale image-caption dataset with 900k multi-spectral samples from the SSL4EOS12 dataset [2]. We developed a captioning pipeline using LLaMA3-LLaVA-NeXT [3] to automatically generate captions based on the RGB channels and Overture Maps base layer tags. A subset of the captions was assessed by domain experts to validate our synthetic data generation. Trained on this large-scale dataset, MS-CLIP demonstrates state-of-the-art performance on zero-shot EO tasks. The ViT-B/16 model reaches a zero-shot classification accuracy of 63% on EuroSAT, outperforming vanilla CLIP by over 10pp. The text-to-image retrieval performance increased by 14pp. to 61% mAP@100. We plan to open-source the dataset and model weights in the future. The model can be used to build multi-spectral zero-shot segmentation models and multi-modal LLMs for Earth Observation, which interpret satellite images beyond the visual spectrum and enable innovative applications. References: [1] Radford, A. et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning. [2] Wang, Y. et al. (2023). SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. In IEEE Geoscience and Remote Sensing Magazine. [3] Li, B. et al. (2024). Llava-next: Stronger llms supercharge multimodal capabilities in the wild.