In this paper we present a study on building various deep neural network-based speech recognition systems for automatic caption generation that can deal with out-of-vocabulary (OOV) words. We develop several kinds of systems using various acoustic (hybrid, CTC, attention-based neural networks) and language modeling (n-gram and RNN-based neural networks) techniques on broadcast news. We discuss various limitations that the proposed systems have and introduce methods to effectively use them to detect OOVs. For automatic OOV recovery, we compare the use of different kinds of phonetic and graphemic sub-word units, that can be synthesized into word outputs. On an experimental three hour broadcast news test set with a 4% OOV rate, the proposed CTC and attention-based systems are capable of reliably detecting OOVs much better (0.52 F-score) than a traditional hybrid baseline system (0.21 F-score). These improved detection gains translate further to better WER performance. With reference to a non-OOV oracle baseline, the proposed systems at just 12% relative (1.4% absolute) loss in word error rate (WER), perform significantly better than the traditional hybrid system (with close to 50% relative loss), by recovering OOVs using their sub-word outputs.