Title :
On Using Multiple Models for Automatic Speech Segmentation
Author :
Park, Seung Seop ; Kim, Nam Soo
Author_Institution :
Sch. of Electr. Eng. & INMC, Seoul Nat. Univ., Seoul
Abstract :
In this paper, we propose a novel approach to automatic speech segmentation for unit-selection based text-to-speech systems. Instead of using a single automatic segmentation machine (ASM), we make use of multiple independent ASMs to produce a final boundary time-mark. Specifically, given multiple boundary time-marks provided by separate ASMs, we first compensate for the potential ASM-specific context-dependent systematic error (or a bias) of each time-mark and then compute the weighted sum of the bias-removed time-marks, yielding the final time-mark. The bias and weight parameters required for the proposed method are obtained beforehand for each phonetic context (e.g., /p/-/a/) through a training procedure where manual segmentations are utilized as the references. For the training procedure, we first define a cost function in order to quantify the discrepancy between the automatic and manual segmentations (or the error) and then minimize the sum of costs with respect to bias and weight parameters. In case a squared error is used for the cost, the bias parameters are easily obtained by averaging the errors of each phonetic context and then, with the bias parameters fixed, the weight parameters are simultaneously optimized through a gradient projection method which is adopted to overcome a set of constraints imposed on the weight parameter space. A decision tree which clusters all the phonetic contexts is utilized to deal with the unseen phonetic contexts. Our experimental results indicate that the proposed method improves the percentage of boundaries that deviate less than 20 ms with respect to the reference boundary from 95.06% with a HMM-based procedure and 96.85% with a previous multiple-model based procedure to 97.07%.
Keywords :
hidden Markov models; learning (artificial intelligence); speech processing; speech recognition; speech synthesis; HMM-based procedure; automatic speech recognition; automatic speech segmentation machine; multiple boundary time-mark; multiple-model based procedure; phonetic context; training procedure; unit-selection based text-to-speech system; Acoustical engineering; Character generation; Concatenated codes; Constraint optimization; Cost function; Decision trees; Degradation; Government; Signal processing; Speech synthesis; Automatic speech segmentation; speech synthesis; unit selection;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2007.903933