Scalable, fault-tolerant job step management for high-performance systems
- D. Solt
- J. Hursey
- et al.
- 2020
- IBM J. Res. Dev
I am a senior software engineer at IBM. I contribute to system software development for on-premises and cloud-based High Performance Computing (HPC) and Artificial Intelligence (AI) systems of all scales and lead a team of developers. My team is dedicated to building the end-to-end software stack that supports executing an AI model using multiple IBM Spyre AIU accelerators for distributed inference and other related tasks.
My work at IBM has involved deploying the ORNL Summit and LLNL Sierra 100+ PFlop pre-exascale HPC systems, where I focused on the Spectrum MPI and Job Step Manager (JSM) components of the HPC software stack. These large-scale systems have honed my skills and broadened my professional interests, which span AI and HPC libraries, cloud ecosystems, containerization, scheduler and runtime systems, scientific computing, and AI/DL/ML workflows.
I received my B.A. degree in Computer Science from Earlham College where I was inspired to pursue high-performance computing and computing for the common good. I received my M.S. and Ph.D. in Computer Science from Indiana University where I worked primarily on fault tolerance in the Open MPI project. I have held internships at multiple national laboratories (LANL, LBNL, ORNL) during graduate school, spent two years as a post-doctoral researcher at ORNL, and four years as an Assistant Professor at the University of Wisconsin-La Crosse before joining IBM to work on the CORAL pre-exascale supercomputers currently running at ORNL and LLNL. Those CORAL systems support a broad portfolio of scientific endeavors including, most recently, the fight against COVID-19.