Deep Neural Networks (DNNs) have achieved impressive accuracy in many application domains including im-age classification. Training of DNNs is an extremely compute-intensive process and is solved using variants of the stochastic gradient descent (SGD) algorithm. A lot of recent research has focused on improving the performance of DNN training. In this paper, we present optimization techniques to improve the performance of the data parallel synchronous SGD algorithm using the Torch framework: (i) we maintain data in-memory to avoid file I/O overheads, (ii) we propose optimizations to the Torch data parallel table framework that handles multi-threading, and (iii) we present MPI optimization to minimize communication overheads. We evaluate the performance of our optimizations on a Power 8 Minsky cluster with 64 nodes and 256 NVidia Pascal P100 GPUs. With our optimizations, we are able to train 90 epochs of the ResNet-50 model on the Imagenet-1k dataset using 256 GPUs in just 48 minutes. This significantly improves on the previously best known performance of training 90 epochs of the ResNet-50 model on the same dataset using the same number of GPUs in 65 minutes. To the best of our knowledge, this is the best known training performance demonstrated for the Imagenet-1k dataset using 256 GPUs.