Neural Network (NN) based acoustic frontends, such as denoising autoencoders, are actively being investigated to improve the robustness of NN based acoustic models to various noise conditions. In recent work the joint training of such frontends with backend NNs has been shown to significantly improve speech recognition performance. In this paper, we propose an effective algorithm to jointly train such a denoising feature space transform and a NN based acoustic model with various kinds of data. Our proposed method first pretrains a Convolutional Neural Network (CNN) based denoising frontend and then jointly trains this frontend with a NN backend acoustic model. In the unsupervised pretraining stage, the frontend is designed to estimate clean log Mel-filterbank features from noisy log-power spectral input features. A subsequent multi-stage training of the proposed frontend, with the dropout technique applied only at the joint layer between the frontend and backend NNs, leads to significant improvements in the overall performance. On the Aurora-4 task, our proposed system achieves an average WER of 9.98%. This is a 9.0% relative improvement over one of the best reported speaker independent baseline system's performance. A final semi-supervised adaptation of the frontend NN, similar to feature space adaptation, reduces the average WER to 7.39%, a further relative WER improvement of 25%.