The Active Memory Cube (AMC) system is a novel heterogeneous computing system concept designed to provide high performance and power-efficiency across a range of applications. The AMC architecture includes general-purpose host processors and specially designed in-memory processors (processing lanes) that would be integrated in a logic layer within 3D DRAM memory. The processing lanes have large vector register files but no power-hungry caches or local memory buffers. Performance depends on how well the resulting higher effective memory latency within the AMC can be managed. In this paper, we describe a combination of programming language features, compiler techniques, operating system interfaces, and hardware design that can effectively hide memory latency for the processing lanes in an AMC system. We present experimental data to show how this approach improves the performance of a set of representative benchmarks important in high performance computing applications. As a result, we are able to achieve high performance together with power efficiency using the AMC architecture.