Workload characterization for operator-based distributed stream processing applications
Abstract
Operator-based programming languages provide an effective development model for large scale stream processing applications. A stream processing application consists of many runtime deployable software processing elements (PE) that work in flows to process incoming messages. Operators (OP) are logical building blocks hosted by PEs. One or more OPs can be fused into a PE at compile-time. Performance optimization for our streaming system includes compile-time fusion optimization and runtime PE-to-host deployment. One of the goals of an optimized stream application is to use minimal computing resource to sustain maximal message throughput. Characterizing the resource usage of PEs is critical for performance optimization. During compile-time optimization, OP-level resource usage is used to predict the resource usage of fused PEs. When starting an application, PE-level resource usage is used as an initial estimation by the scheduler. In this paper, we propose an efficient workload characterization approach for data stream processing systems. Our method includes the procedures for obtaining reusable OP-level resource usage information from profiling data and recomposing OP-level profiles to predict PE-level resource usage. We present several techniques to overcome measurement errors from the OP data collection. The impact of hardware speed and multi-threading contention on hyper-threading and multi-core machines are also studied. We show that our method can be applied to several streaming applications and the prediction of the PE CPU resource usage is within 15% of the actual CPU usage. © 2010 ACM.