An Architecture for Heterogeneous High-Performance Computing Systems: Motivation and Requirements

Christoph Hagleitner; Charles Johns; Christian Pinto; Constantinos Evangelinos; Florian Auernhammer; Guerney Hunt; James Sexton; Jim Kahle; Michael Johnston

IEEE JVA Symposium 2023

Conference paper

05 Jul 2023

An Architecture for Heterogeneous High-Performance Computing Systems: Motivation and Requirements

Abstract

Today’s rapid progress in AI, and science is largely fueled by the availability of ever larger and more powerful compute systems. The “classic” HPC systems targeted at executing complex workflows and simulations have recently crossed the exaflop boundary in terms of their double-precision floating point performance. At the same time, new systems targeted at training large AI-models use alternative number representations and are already pushing the limits well beyond the exaflop barrier. In order to continue scaling the performance of large HPC systems, system architects need to address several barriers including the slowdown of Moore’s law, energy density limitations, production yield challenges and practical limits to overall power consumption in the 10s-of-MW range. All recent #1 HPC systems are already relying on specialized, heterogenous components to offset the slowdown. As spezialization continues and advances, the heterogeneity will evolve from today’s CPU-GPU combinations into a broad set of more specialized accelerators, but also entirely new computing paradigms, eg, Quantum computing are emerging. As the total system size and the compute density within a single node are scaled further, the intra- and inter-node communication requirements increase accordingly. Today, all available interconnect fabrics that support symmetric-multi processing (SMP) and/or asymmetric variants of cache-coherent communication are based on proprietary implementations, which prevent the assembly of innovative heterogeneous high-performance systems from components from more than a single vendor. Hence, system architects are looking at ways to contain complexity and simplify systems deployments. Under continued cost constraints, better utilization is desired to match hardware configuration to software usage needs. The ability to compose virtual compute nodes from a set of disaggregated components is a natural way of approaching the problem. The first challenge of composability that is currently being tackled is memory disaggregation. A vision of higher utilization and resource sharing is appealing, but low latency and high bandwidth need to be maintained. All of these trends and limitations demand a fresh look at the architectures needed to enable an ecosystem from which domain-specific high-performance computing systems can be assembled.

In this presentation, we discuss the motivation and requirements for a new node level and rack scale architecture as well as the need for a standards-based and open high-performance interconnect fabric This architecture of this system needs to be accompanied by an open and interoperable software stack as well as a fine-grained control plane. The control plane enables and supports composability under tight security and performance constraints. Composability originated as a way to increase efficiencies of heterogeneous computer systems. More recently, composability has also been proposed as means for heterogeneous components to share a common memory pool, reduce data traffic and increase the speed of cooperation among the heterogenous elements. Furthermore, the use of heterogenous must expand from the current rack-level or board-level integration down to chiplet based modules, and even System-on-Chip (SoC), depending on the scale and demands of a workflow. A standards-based interconnect fabric is a key element that will allow innovations from different heterogeneous components to be mixed beyond the limitations of any single vendor and is also a key ingredient for an industry growth play. For board-level connections outside of the symmetric multiprocessing fabric (SMP), the evolving CXL standards seems to be a good match for this role as it supports traditional I/O connect plus a scalable memory extension. CXL over UCIe offers the possibility to extend this value proposition from board-level to a chiplet-based ecosystem. For extended reach, CXL over an optics standard would provide for even larger scale composable systems. The composable elements in a compute fabric need a distributed control structure for initialization, resource management, and workflow control. Open standards such as OFMF will play a critical role in the overall system management. Open standards will also play a critical role in the security of a system assembled from heterogeneous elements. Security features will be required to address supply chain attacks, provide for authentication and attestation of each component, for secure and confidential communication between the various components in the heterogenous system, and to provide a Trusted Execution Environment for Confidential Computing. Isolation and then attestation of these elements would need to be consistent across the system

Conference paper