This paper investigates a variety of progressively more complex similarity measures for vocabulary independent search in phone based audio transcripts. English audio data is segmented and decoded to produce a sequence of phones that represent the data. These sequences are then parsed into N-grams which are used to index the data. The audio segments define the documents to be retrieved and are thus localized in time. Search is performed by expanding text based queries into phone sequences and N-grams, followed by matching these against the index. The baseline similarity measure combines elements found in the literature and uses edit distance with a phonetic confusion matrix to determine the similarity of query and index N-grams. Comparable performance to other approaches in the literature is achieved. Extensions to the baseline are developed using a constrained form of the similarity measure together with the ability to account for higher order confusions, namely of phone bi-grams and tri-grams. Results show improved performance across a variety of system configurations. We then generalize further and use the framework of conditional random fields (CRFs) to model confusions. Whereas others in the literature have used CRFs to model parameters of an edit distance that incorporates deletions, substitutions, and insertions, our approach focuses on using CRFs to model context dependent phone level confusions directly. The CRF is trained on parallel phonetic transcripts, which provides a general framework for modeling the errors that a recognition system may make, taking contextual effects into consideration. Results obtained on both in and out of vocabulary (OOV) search tasks improve most notably for OOV, showing 5%-6% relative improvement. Finally, we investigate the degree to which the information captured in the three approaches is complementary and show that system combination can further improve performance. © 2012 IEEE.