TensorLakeHouse: A High-Performance, Open-Source Platform for Accelerated Geospatial Data Management with Hierarchical Statistical Indices
Abstract
The exponential growth of geospatial data presents unprecedented opportunities and challenges for data-driven decision-making. To fully leverage this critical resource, we introduce TensorLakeHouse, a novel open-source platform for scientific data management and processing. Building upon state-of-the-art standards such as STAC, XArray and OpenEO, TensorLakeHouse offers support for common geospatial data formats (COG, ZARR, netCDF) and storage solutions (S3, cloud object stores, cluster file systems). To accelerate query performance and enable intelligent data access, we introduce Hierarchical Statistical Indices (HSI) which facilitate data read skipping and direct query resolution from aggregate data. Additionally, our platform optimizes OpenEO processing graphs to support federated queries across distributed environments, thereby enabling complex analyses on large-scale datasets. TensorLakeHouse provides a holistic solution for the entire geospatial data lifecycle, encompassing data ingestion, discovery, high-performance streaming to GPU memory for AI workloads, and optimized data output. The platform’s flexible, read stride-based data layout ensures wire-speed data delivery. In this presentation, we will provide a technical overview of TensorLakeHouse. We will elaborate on the platform's architecture, with focus on Hierarchical Statistical Indices and their role in optimizing query performance. We conclude with a production example taken from one of our projects where we use TensorLakeHouse to directly sample training data from a petabyte scale data store to a PyTorch Lighning Trainer showing orders of magnitude performance improvements over state-of-the-art. This work was co-funded by the European Union (Horizon Europe, Embed2Scale, 101131841) and also received funding from the Swiss State Secretariat for Education, Research and Innovation (SERI).