Integrating domain-data steering with code-profiling tools to debug data-intensive workflows
Abstract
Computer simulations may be composed of scientific programs chained in a coherent flow and executed in High Performance Computing environments. These executions may present anomalies associated to the data that flows in parallel among programs. Several parallel code-profiling tools already support performance analysis, such as Tuning and Analysis Utilities (TAU) or provide fine-grained performance statistics such as the System Activity Report (SAR). However, these tools do not associate their results to their corresponding dataflows. Such analysis is fundamental to trace back the data origins of an error. In this paper, we propose to couple a workflow monitoring data approach to parallel code-profiling tools for workflow executions. The goal is to profile and debug parallel workflow executions by querying a database that is able to integrate performance, resource consumption, provenance, and domain data from simulation programs at runtime. We have implemented our data monitoring approach as a software component that was coupled to TAU and SAR code profiling tools. We show how querying the resulting integrated database enables domain-aware runtime steering of performance anomalies by using the astronomy Montage workflow, as a motivating example. We observe that the overhead introduced by our approach is negligible.