Near-data processing is a promising paradigm to address the bandwidth, latency, and energy limitations in today's com-puter systems. In this work, we introduce an architecture that enhances a contemporary multi-core CPU with new features for supporting a seamless integration of near-data processing capabilities. Crucial aspects such as coherency, data placement, communication, address translation, and the programming model are discussed. The essential compo-nents, as well as a system simulator, are realized in hardware and software. Results for the important Graph500 bench-mark show a 1.5x speedup when using the proposed archi-Tecture.