مرکز منطقه ای اطلاع رساني علوم و فناوري - A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition

DocumentCode :

763654

Title :

A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition

Author :

Deng, Li ; Yu, Dong ; Acero, Alex

Author_Institution :

Microsoft Res., Redmond, WA, USA

Volume :

Issue :

fYear :

2006

Firstpage :

256

Lastpage :

265

Abstract :

A structured generative model of speech coarticulation and reduction is described with a novel two-stage implementation. At the first stage, the dynamics of formants or vocal tract resonances (VTRs) in fluent speech is generated using prior information of resonance targets in the phone sequence, in absence of acoustic data. Bidirectional temporal filtering with finite-impulse response (FIR) is applied to the segmental target sequence as the FIR filter´s input, where forward filtering produces anticipatory coarticulation and backward filtering produces regressive coarticulation. The filtering process is shown also to result in realistic resonance-frequency undershooting or reduction for fast-rate and low-effort speech in a contextually assimilated manner. At the second stage, the dynamics of speech cepstra are predicted analytically based on the FIR-filtered and speaker-adapted VTR targets, and the prediction residuals are modeled by Gaussian random variables with trainable parameters. The combined system of these two stages, thus, generates correlated and causally related VTR and cepstral dynamics, where phonetic reduction is represented explicitly in the hidden resonance space and implicitly in the observed cepstral space. We present details of model simulation demonstrating quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. This two-stage model is implemented and applied to the TIMIT phonetic recognition task. Using the N-best (N=2000) rescoring paradigm, the new model, which contains only context-independent parameters, is shown to significantly reduce the phone error rate of a standard hidden Markov model (HMM) system under the same experimental conditions.

Keywords :

FIR filters; Gaussian processes; cepstral analysis; filtering theory; hidden Markov models; speech processing; speech recognition; Gaussian random variables; anticipatory coarticulation; backward filtering; bidirectional target-filtering model; bidirectional temporal filtering; finite-impulse response filter; forward filtering; hidden Markov model system; phone sequence; phonetic recognition; phonetic reduction; regressive coarticulation; resonance-frequency undershooting; speaking rate; speech cepstra dynamics; speech coarticulation; speech reduction; vocal tract resonances; Cepstral analysis; Filtering; Finite impulse response filter; Hidden Markov models; Resonance; Speech analysis; Speech processing; Speech recognition; Target recognition; Video recording; Cepstral dynamics; TIMIT; contextual assimilation; filtering of targets; formant dynamics; long-span context dependence; phonetic recognition; phonetic reduction; resonances;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher :

ieee

ISSN :

1558-7916

Type :

jour

DOI :

10.1109/TSA.2005.854107

Filename :

1561282

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=763654