A processor core is presented for AI training and inference products. Leading-edge compute efficiency is achieved for robust fp16 training via efficient heterogeneous 2-D systolic array-SIMD compute engines leveraging compact DLFloat16 FPUs. Architectural flexibility is maintained for very high compute utilization across neural network topologies. A modular dual-corelet architecture with a shared scratchpad and a software-controlled network/memory interface enables scalability to many-core SoCs and large-scale systems. The 14nm AI core achieves fp16 peak performance of 3.0 TFLOPS at 0.62V and 1.4 TFLOPS/W at 0.54V.