From speech to talking faces: lip movements estimation based on linear approximators

Author

Vignoli, F.

Author_Institution

Genoa Univ.

Volume

6

fYear

2000

fDate

2000

Firstpage

2381

Abstract

In human communication, speech understanding is greatly improved by the bimodal acoustic-visual effect, with respect to simple speech. This is particularly clear when the communication takes place in noisy environments or for non-native speakers. In this paper, we propose a novel algorithm based on linear approximators that estimates the lip movements from a timed sequence of phonemes. This sequence can be generated from real speech, by a segmentation technique based on a hidden Markov model (HMM), or from a text-to-speech system. The algorithm consists of two modules: the training module and the synthesis module. The training module is based on a eigen-analysis of an audiovisual database recorded for this purpose. The synthesis module takes as input the sequence of phonemes and implements an implicit coarticulation model. A later post-processing step converts the parameters estimated into a sequence of facial animation parameters that are compliant to the new MPEG-4 standard. The algorithm has been tested with FAE (Facial Animation Engine), which is an MPEG-4 compliant system developed at the author´s university

Keywords

approximation theory; audio-visual systems; code standards; computer animation; eigenvalues and eigenfunctions; face recognition; hidden Markov models; learning systems; motion estimation; parameter estimation; sequences; speech intelligibility; speech synthesis; subroutines; FAE; Facial Animation Engine; MPEG-4 compliant system; audiovisual database; bimodal acoustic-visual effect; eigen-analysis; facial animation parameter sequence; hidden Markov model; human communication; implicit coarticulation model; linear approximators; lip movement estimation; noisy environments; nonnative speakers; parameter estimation; post-processing; speech segmentation technique; speech synthesis module; speech understanding; talking faces; text-to-speech system; timed phoneme sequence; training module; Audio databases; Facial animation; Hidden Markov models; Humans; Linear approximation; Loudspeakers; MPEG 4 Standard; Parameter estimation; Speech synthesis; Working environment noise;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. 2000 IEEE International Conference on

Conference_Location

Istanbul

ISSN

1520-6149

Print_ISBN

0-7803-6293-4

Type

conf

DOI

10.1109/ICASSP.2000.859320

Filename

859320