Abstract
To increase the performance of data-intensive applications, we present an extension to a CPU architecture that enables arbitrary near-data processing capabilities close to the main memory. This is realized by introducing a component attached to the CPU system-bus and a component at the memory side. Together they support hardware-managed coherence and virtual memory support to integrate the near-data processors in a shared-memory environment. We present an implementation of the components, as well as a systemsimulator, providing detailed performance estimations. With a variety of syntheticworkloadswe demonstrate the performance of the memory accesses, the mixed fine-And coarse-grained coherence mechanisms, and the near-data processor communication mechanism. Furthermore, we quantify the inevitable start-up penalty regarding coherence and data writeback, and argue that near-data processingworkloads should access data several times to offset this penalty. A case study based on the Graph500 benchmark confirms the small overhead for the proposed coherence mechanisms and shows the ability to outperform a real CPU by a factor of two.