CLOUD 2024
Conference paper

Self Adjusting Log Observability for Cloud Native Applications


With the increasing complexity of modern applications, particularly those relying on microservices architectures, the volume of observability data, encompassing logs, metrics, traces, etc., has significantly surged. This is further exacerbated due to extensive deployment of applications on cloud, where observability is crucial for tracking the health, performance, and post-hoc diagnosis, leading to collecting as much data as possible for the "fear of missing out". However, the collection, storage, and analysis of this data come at a considerable cost, both in terms of resources and money. Specifically, logs constitute the largest portion of the observability data volume, so they have the most effect on the observability cost. Moreover, Logs also exhibit unstructured and noisy characteristics, where the efficacy of downstream AIOps tasks (Day-2 operations), such as log anomaly detection, root cause analysis, fault category prediction etc., can be negatively impacted by log data volume. Hence, striking a balance between the benefits of log observability and its impact on day-2 operations and debuggability is essential. In this paper, we propose an autonomous system that selectively collect logs when it is needed, from where it is needed, and at the required granularity as opposed to collecting logs from everywhere all the time. Our experiments show that implementing such a system drastically decreases the log volume, by as much as 90\%, while still maintaining data quality for downstream AIOps usage, especially for post-hoc diagnosis tasks. Operating on a reduced volume of log data not only decreases storage, transfer, and retention costs but also streamlines observability pipelines, making them leaner, more efficient, and less resource-hungry.