CLOUD 2024
Conference paper

Intent-Driven Multi-Engine Observability Dataflows For Heterogeneous Geo-Distributed Clouds


With the growth of multi-cloud computing across a heterogeneous substrate of public cloud, edge, and on-premise sites, observability has been gaining importance in comprehend- ing the state of availability and performance of large-scale geo- distributed software systems. Collecting, processing and analyz- ing observability data from multiple geo-distributed clouds can be naturally modeled as dataflows comprising functions that are chained together. These observability flows pose a unique set of challenges including (a) keeping cost budgets, resource overheads, network bandwidth consumed and latency low, (b) scaling to a large number of clusters, (c) adapting the volume of observability data to ensure resource constraints and service level objectives are met (d) supporting diverse engines per dataflow depending on each processing function of the flow and (e) automating and opti- mizing placement of observability processing functions including closed-loop orchestration. Towards this end, we propose Octopus, a multi-cloud multi-engine observability processing framework. In Octopus, declarative observability dataflows (DODs) serve as an intent-driven abstraction for site reliability engineers (SREs) to specify self-driven observability dataflows. A dataflow engine in Octopus, then orchestrates these DODs to automatically deploy and self-manage observability dataflows over large fabrics spanning multiple clouds and clusters. Octopus supports a mix of streaming and batch functions and supports pluggable run- time engines, thereby enabling flexible composition of multi- engine observability flows. Our early deployment experience with Octopus is promising. We have successfully deployed production- grade metrics analysis and log processing data flows in an objective-optimized fashion across 1 cloud and 10 edge clusters spanning continents. Our results indicate data volume and WAN bandwidth savings of 2.3x and 56%, respectively, the ability to support auto-scaling of DODs as input load varies, and the ability to flexibly relocate functions across clusters without hurting latency targets.