CogNETive - Insights and visualization for operations@scale

Dean H. Lorenz; Eran Raichstein; Kathy Barabash; Hillel Kolodner; Liran Schour; Shelly Garion

doi:10.1145/3078468.3078495

SYSTOR 2017

Conference paper

22 May 2017

CogNETive - Insights and visualization for operations@scale

View publication

Abstract

Operating a cloud-scale service is a huge challenge. There are millions of users worldwide and millions of requests per seconds. For example, Amazon's Simple Storage Service (S3) in 2013 contained two trillion objects and its logs contained 1.1 million log lines per second, which are approximately 10 PB of log records per year (see [1]). Cloud scale implies thousands of servers and network elements, and hundreds of services from multiple cross-regional data centers. Cloud service operation data is scattered over various types of semi-structured and unstructured logs (e.g., application, error, debug), telemetry and network data, as well as customer service records. It is therefore extremely difficult for the multiple owners and administrators in such systems, coming from different units of the organization, to follow the possible paths and system alternatives in order to detect problems, solve issues and understand the service operation. There exist many tools for collection, analysis, and visualization of operational data, both proprietary, such as Splunk and vRealize, as well as open-source tools such as Elasticsearch-Logstash-Kibana (ELK) or Grafana. However, the current Big Data analytics and machine learning techniques are still in their infancy when it comes to dealing with the specific domain of IT operations. Current methods produce lots of dashboards, reports and alerts, which are hard to understand, opaque, and require interdisciplinary DevOps skills. Important problems may take too long to solve or are overlooked, and trends and imminent problems are not detected before service is affected. To monitor largescale distributed systems, major cloud companies develop their own tools to collect logs, traces and telemetry data, and analyze them (such as, Google's Dapper [2], Facebook's Mystery Machine [3], and Netix's Atlas [4]). This shows that log collection and operational analytics plays an essential role in production cloud-scale environments.

Conference paper