Data Augmentation Based on Vowel Stretch for Improving Children's Speech Recognition
Abstract
Prolongation is a speech disfluency that lengthens some portions of speech utterances. It is frequently observed in children's spontaneous speech, while it is rare in read speech. To make acoustic models more robust to children's spontaneous speech, collecting a large amount of children's speech data containing prolongation is usually required, which is very impractical in many cases. To tackle this problem, we propose a novel data augmentation method that virtually generates additional data by simulating prolongation. The method inserts pseudo frames into specific positions of speech utterances to simulate prolongation. The acoustic features of the inserted frames are calculated from the original frames on both sides. This is based on our analysis that many of vowels are actually stretched in children's spontaneous speech. Our proposed procedure can generate partially stretched utterances with low computational costs, unlike a conventional speed or tempo perturbation method that extends and shrinks entire utterances at a uniform rate. The effectiveness of the proposed method were confirmed with the experiments of acoustic model adaptations, in which our proposed method focusing on vowel stretch showed consistent improvement compared with conventional speed and tempo perturbation approach.