DocumentCode
975493
Title
Speech-to-video synthesis using MPEG-4 compliant visual features
Author
Aleksic, Petar S. ; Katsaggelos, Aggelos K.
Author_Institution
Dept. of Electr. & Comput. Eng., Northwestern Univ., Evanston, IL, USA
Volume
14
Issue
5
fYear
2004
fDate
5/1/2004 12:00:00 AM
Firstpage
682
Lastpage
692
Abstract
There is a strong correlation between the building blocks of speech (phonemes) and the building blocks of visual speech (visimes). In this paper, this correlation is exploited and an approach is proposed for synthesizing the visual representation of speech from a narrow-band acoustic speech signal. The visual speech is represented in terms of the facial animation parameters (FAPs), supported by the MPEG-4 standard. The main contribution of this paper is the development of a correlation hidden Markov model (CHMM) system, which integrates independently trained acoustic HMM (AHMM) and visual HMM (VHMM) systems, in order to realize speech-to-video synthesis. The proposed CHMM system allows for different model topologies for acoustic and visual HMMs. It performs late integration and reduces the amount of required training data compared to early integration modeling techniques. Temporal accuracy experiments, comparison of the synthesized FAPs to the original FAPs, and audio-visual automatic speech recognition (AV-ASR) experiments utilizing the synthesized visual speech were performed in order to objectively measure the performance of the system. The objective experiments demonstrated that the proposed approach reduces time alignment errors by 40.5% compared to the conventional temporal scaling method, that the synthesized FAP sequences are very similar to the original FAP sequences, and that synthesized FAP sequences contain visual speechreading information that can improve AV-ASR performance.
Keywords
acoustic signal processing; audio-visual systems; computer animation; hidden Markov models; speech processing; speech recognition; speech synthesis; video signal processing; MPEG-4 compliant visual features; acoustic hidden Markov model; acoustic speech signal; audio-visual automatic speech recognition; correlation hidden Markov model system; facial animation parameters; integration modeling techniques; lip synchronization; speech-to-video synthesis; speechreading information; temporal scaling method; training data; visual hidden Markov model; visual speech; Automatic speech recognition; Facial animation; Financial advantage program; Hidden Markov models; MPEG 4 Standard; Narrowband; Signal synthesis; Speech synthesis; Topology; Training data;
fLanguage
English
Journal_Title
Circuits and Systems for Video Technology, IEEE Transactions on
Publisher
ieee
ISSN
1051-8215
Type
jour
DOI
10.1109/TCSVT.2004.826760
Filename
1294959
Link To Document