Publication
Future Generation Computer Systems
Paper

Adding domain data to code profiling tools to debug workflow parallel execution

View publication

Abstract

Computer simulations may be composed of several scientific programs chained in a coherent flow running in High Performance Computing and cloud environments. These runs may present different execution behavior associated to the parallel flow of data among programs. Gather insight into the parallel flow of data is important for several applications. The usual way of getting insight into code performance is by means of a code-profiler. Several parallel code-profiling tools already support performance analysis, such as Tuning and Analysis Utilities (TAU), or provide fine-grained performance statistics, e.g., System Activity Report (SAR). These tools are effective for code profiling, but are not connected to the concept of IO-intensive workflows. Analyzing the workflow execution with domain and performance data is important for users because they can identify anomalies, choose suitable machines to run their workflows, etc. This type of analysis may be performed by capturing execution data enriched with fine-grained domain data during the long-term run of a computer simulation. In this paper, we propose a monitoring data capture approach as a component that couples code-profiling tools to domain data from workflow executions. The goal is to profile and debug parallel executions of workflows through queries to a database that integrates performance, resource consumption, provenance, and domain data from simulation programs flow at runtime. We show how querying this database with domain-aware data at runtime allows to identify performance anomalies not detected by code-profiling tools. We evaluate our approach using the astronomy Montage workflow on a cluster environment and the SciPhy bioinformatics workflow on the Amazon cloud. In both cases computing time overhead imposed by our approach for gathering fine-grained domain, performance, and resource consumption data is negligible.