مرکز منطقه ای اطلاع رساني علوم و فناوري - Time-delay neural networks for estimating lip movements from speech analysis: a useful tool in audio-video synchronization

DocumentCode :

1324160

Title :

Time-delay neural networks for estimating lip movements from speech analysis: a useful tool in audio-video synchronization

Author :

Lavagetto, Fabio

Author_Institution :

Dept. of Commun. Comput., Genoa Univ., Italy

Volume :

Issue :

fYear :

1997

fDate :

10/1/1997 12:00:00 AM

Firstpage :

786

Lastpage :

800

Abstract :

A new technology is proposed for audio-video synchronization in multimedia applications where talking human faces, either natural or synthetic, are employed for interpersonal communication services, home gaming, advanced multimodal interfaces, interactive entertainment, or in movie production. Facial sequences, in fact, represent an acoustic-visual source characterized by two strongly correlated components: a talking face and the associated speech, whose synchronous presentation must be guaranteed in any multimedia application. Therefore, the exact timing for displaying a video frame or for generating a synthetic facial image has to be supervised by some form of speech analysis performed either as preprocessing before encoding or as postprocessing before presentation. Experimental results are reported on the use of time-delay neural networks (TDNN) for the direct estimation of the visible articulation of the mouth starting from the coherent analysis of acoustic speech. The architectural solution of employing a bank of independent single-output TDNNs has been compared to the alternative solution of using only a single multi-output TDNN. Similarly, two different learning procedures have been applied and compared for training the TDNN, the first based on the classic mean square error (MSE) and the second based on a measure of cross-correlation (CC). The superiority of the system based on multiple single-output TDNNs has been proved as well as the improvements, both in terms of convergence speed and estimation fidelity, achievable through the learning algorithm based on cross-correlation

Keywords :

audio-visual systems; convergence of numerical methods; correlation methods; delays; image sequences; learning (artificial intelligence); least mean squares methods; multimedia communication; neural nets; parameter estimation; speech processing; synchronisation; acoustic speech; audio-video synchronization; coherent analysis; convergence speed; correlated components; cross-correlation measure; experimental results; facial sequences; home gaming; interactive entertainment; interpersonal communication services; learning algorithm; learning procedures; lip movement estimation; mean square error; movie production; multimedia applications; multimodal interfaces; multiple single-output TDNN; speech analysis; talking human faces; time-delay neural networks; training; video frame; Face; Humans; Image coding; Mean square error methods; Motion pictures; Mouth; Multimedia systems; Neural networks; Speech analysis; Timing;

fLanguage :

English

Journal_Title :

Circuits and Systems for Video Technology, IEEE Transactions on

Publisher :

ieee

ISSN :

1051-8215

Type :

jour

DOI :

10.1109/76.633499

Filename :

633499

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1324160