About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
IC2E 2024
Conference paper
Unleashing Performance Insights with Online Probabilistic Tracing
Abstract
Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing workloads in production can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the 'spans' within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 20-35% of available instrumentation, markedly reducing tracing overhead, storage, compute costs, and trace analysis time.