AGU Fall 2023
Conference paper

Area Sampling for Training Geospatial Foundation Models


To accurately train geospatial unsupervised models, ensuring diversity and integrity in the datasets is paramount. This study presents a novel method that focuses on enhancing the diversity of statistics within geospatial information, providing a more accurate representation of the underlying geographical characteristics. Our approach involves extracting multiple statistics, including land use, temperature, and precipitation, from specific areas at resolutions finer than the defined tiles. By clustering similar geographical statistics, we create distinct clusters enabling a more comprehensive understanding of the data distribution. To ensure representative sampling from each cluster, we count the data points within each area and establish weighted sampling. To enhance diversity, our method down-weights higher frequency data points, favoring less frequent data for sampling. This strategy guarantees a balanced representation across the entire dataset, enhancing the overall accuracy of the geospatial foundation model. The results of our study demonstrate the potential in optimizing geospatial data sampling for a wide array of applications and modeling tasks, ultimately leading to improved model accuracy and broader practicality.