Big data analysis of cloud storage logs using Spark
We use Apache Spark analytics to investigate the logs of an operational cloud object store service to understand how it is being used. This investigation involves going over very large amounts of historical data (PBs of records in some cases) collected over long periods of time retroactively. Existing tools, such as Elasticsearch-Logstash-Kibana (ELK), are mainly used for presenting short-term metrics and cannot perform advanced analytics such as machine learning. A possible solution is to save for long periods only certain aggregations or calculations produced from the raw log data, such as averages or histograms, however these must be decided in advance, and cannot be changed retroactively since the raw data has already been discarded. Spark allows us to gain insights going over historical data collected over long periods of time and to apply the historical models on online data in a simple and efficient way. Stoica and Ha  briey describe the API for parsing, loading and basic analytics of logs with Spark, and Lin at al.  show that Spark is more efficient than Hadoop for certain log analysis use cases. However, it is not enough to just use Spark. Given the volume of operations on a cloud object store and the number of objects, even with Spark we need to be smart about the way we do the analysis. Our techniques include sampling, smart grouping and aggregation, and the use of machine learning methods targeted to log data. We present three use cases where we apply Spark to analyze cloud object store logs. The first use case is to identify time frames in which the performance decreased, causing a subsequent increase in operation latencies. We decided to focus on HEAD operations, as their latency is independent of object size. Next, we wanted to calculate the percentiles of the latencies for each time period (e.g., median, 90%, 99%). Since there could be millions of operations per second, it would be impractical to collect all the latencies, sort them, and calculate the exact percentiles. We decided to estimate the percentiles by dividing all the latencies into a histogram with a fixed number of cells, where each cell represents a range of latencies. Then we used the "Map/Reduce" method to map each log line to its appropriate cell and to sum the number of lines in each cell. Figure 1 shows the results of our analysis. We see that toward the end of the period the latencies increase in an anomalous way. In the second use case, we show how to estimate the potential for archiving for this large service, e.g., estimate the number of candidate objects, the expected archive size, and the criteria for archiving. Compiling information from the logs for all objects that have ever been created, used, rewritten, or erased is challenging since there are billions or trillions of objects. The last use case is to detect security threats and anomalies. With Spark we can train a model of "normal" customer behavior based over long time spans and then detect customers that exhibit abnormal behavior. We adapted Melody , an anomaly detection technology for logs, to the specific use case of object store logs.