ACL-Findings 2022
Conference paper

Partitioned Gradient Matching based Data Subset Selection for Compute-Efficient & Robust ASR Training


Training state-of-the-art ASR systems such as RNN-T often have a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve performance on-par with training with the entire dataset. Although there are many data subset selection (DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, since RNN-T tends to have gradients with a significantly larger memory footprint. In this paper we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3× to 6× speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.