By performing parallelized multiply-accumulate operations in the analog domain at the location of weight data, crossbar-array “tiles” of analog non-volatile memory (NVM) devices can potentially accelerate the forward-inference of deep neural networks. To be successful, such systems will need to achieve two related but challenging goals. First is the achievement of high neural network classification accuracies, indistinguishable from those achieved with conventional approaches, despite the difficulties of programming NVM devices accurately in the presence of significant device-to-device variability. Towards this first goal, we describe row-wise Phase-Change Memory (PCM) programming schemes for rapid yet accurate weight-programming. The second goal is highly energy-efficient forward-inference of multi-layer neural networks, requiring efficiency in both the massively-parallel analog-AI operations performed at each tile, as well as efficiency in how the resulting neuron-excitation data vectors get conveyed from tile to tile. Towards this second goal, micro-architectural design ideas including source-follower-based readout, array segmentation, and transmit-by-duration are described.