Data-driven segment preselection in the IBM trainable speech synthesis system
Abstract
Unit selection based concatenative speech synthesis has proven to be a successful method of producing high quality speech output. However, in order to produce high quality speech, large speech databases are required. For some applications, this is not practical due to the complexity of the database search process and the storage requirements of such databases. In this paper, we propose a data-driven algorithm to reduce the database size used in concatenative synthesis. The algorithm preselects database speech segments based on statistics collected by synthesizing a large number of sentences using the full speech database. The algorithm is applied to the IBM trainable speech synthesis system and the results show that database size can be reduced substantially while maintaining the output speech quality.