Data Movement Accelerator Engines on a Prototype Power10 Processor
Abstract
This paper presents the design and implementation of Active Messaging Engines (AMEs) on an IBM Power10 prototype chip. AMEs are tiny, simple, but fully programmable 64-bit processors, for offloading operations related to data movement. AMEs can offload execution flow of MPI and other messaging stacks from the host CPU, enabling truly asynchronous progress to overlap computation and communication. The AMEs are implemented as on-board OpenCAPI-compliant accelerators, leveraging existing OpenCAPI infrastructure. As realized in a 7 nm technology, each AME takes 0.034 mm2 of silicon area and 4.1 mW of power. AME performance is evaluated across several contiguous and non-contiguous memory copy scenarios. AMEs can perform up to the bandwidth limit of their access path to the main memory (32 GB/s) and incur a per-request overhead of about 600 ns. These results indicate that AMEs will confer advantages to general messaging libraries for processing, sending, and receiving on-node and off-node messages.