We describe a hybrid GPU/CPU architecture for stochastic gradient descent training of neural network acoustic models under a lattice-based minimum Bayes risk (MBR) criterion. The crux of the method is to run SGD on a GPU card which consumes framerandomized mini-batches produced by multiple workers running on a cluster of multi-core CPU nodes which compute HMM state MBR occupancies. To minimize communication cost, a separate thread running on the GPU host receives minibatches from and sends updated models to the workers, and communicates with the SGD thread via a producer-consumer queue of minibatches. Using this architecture, it is possible to match the speed of GPU-based SGD cross-entropy (CE) training (1 hour of processing per 100 hours of audio on Switchboard). Additionally, we compare different ways of doing frame randomization and discuss experimental results on three LVCSR tasks (Switchboard 300 hours, English broadcast news 50 hours, and noisy Levantine telephone conversations 300 hours).