Excavating the hidden parallelism inside DRAM architectures with buffered compares
Abstract
We propose an approach called buffered compares, a less-invasive processing-in-memory solution that can be used with existing processor memory interfaces such as DDR3/4 with minimal changes. The approach is based on the observation that multibank architecture, a key feature of modern main memory DRAM devices, can be used to provide huge internal bandwidth without any major modification. We place a small buffer and a simple ALU per bank, define a set of new DRAM commands to fill the buffer and feed data to the ALU, and return the result for a set of commands (not for each command) to the host memory controller. By exploiting the under-utilized internal bandwidth using 'compare-n-op' operations, which are frequently used in various applications, we not only reduce the amount of energyinefficient processor-memory communication, but also accelerate the computation of big data processing applications by utilizing parallelism of the buffered compare units in DRAM banks. We present two versions of buffered compare architecture-fullscale architecture and reduced architecture-in trade of performance and energy. The experimental results show that our solution significantly improves the performance and efficiency of the system on the tested workloads.