We introduce a highly heterogeneous and programmable compute-in-memory (CIM) accelerator architecture for deep neural network (DNN) inference. This architecture combines spatially distributed CIM memory array 'tiles' for weight-stationary, energy-efficient multiply-accumulate (MAC) operations, together with heterogeneous special-function compute cores for auxiliary digital computation. Massively parallel vectors of neuron activation data are exchanged over short distances using a dense and efficient circuit-switched 2-D mesh, offering full end-to-end support for a wide range of DNN workloads, including CNNs, long-short-term-memory (LSTM), and transformers. We discuss the design of the 'analog fabric' - the 2-D grid of tiles and compute cores interconnected by the 2-D mesh - and address the efficiency in both mapping of DNNs onto the hardware and in pipelining of various DNN workloads across a range of batch sizes. We show, for the first time, system-level assessments using projected component parameters for a realistic 'analog AI' system, based on dense crossbar arrays of low-power nonvolatile analog memory elements, while incorporating a single common analog fabric design that can scale to large networks by introducing data transport between multiple analog AI chips. Our performance estimates for several networks, including large LSTM and bidirectional encoder representations from transformers (BERT), show highly competitive throughput while offering 40× - 140× higher energy efficiency than NVIDIA A100 - thus illustrating the strong promise of analog AI and the proposed architecture for DNN inference applications.