This paper presents an energy-efficient and high throughput architecture for convolutional neural networks (CNN). Architectural and circuit techniques are proposed to address the dominant energy and delay costs associated with data movement in CNNs. The proposed architecture employs a deep in-memory architecture, to embed energy-efficient low swing mixed-signal computations in the periphery of the SRAM bitcell array. An efficient data access pattern and a mixed-signal multiplier are proposed to exploit data reuse opportunities in convolution. Silicon-validated energy, delay, and behavioral models of the proposed architecture are developed and employed to perform large-scale system simulations. System-level simulations using these models show >97% detection accuracy on the MNIST data set, along with $4.9\times $ and $2.4\times $ improvements in energy efficiency and throughput, respectively, leading to $11.9\times $ reduction in energy-delay product as compared with a conventional (SRAM + digital processor) architecture.