Abstract :
A new technology is proposed for audio-video synchronization in multimedia applications where talking human faces, either natural or synthetic, are employed for interpersonal communication services, home gaming, advanced multimodal interfaces, interactive entertainment, or in movie production. Facial sequences, in fact, represent an acoustic-visual source characterized by two strongly correlated components: a talking face and the associated speech, whose synchronous presentation must be guaranteed in any multimedia application. Therefore, the exact timing for displaying a video frame or for generating a synthetic facial image has to be supervised by some form of speech analysis performed either as preprocessing before encoding or as postprocessing before presentation. Experimental results are reported on the use of time-delay neural networks (TDNN) for the direct estimation of the visible articulation of the mouth starting from the coherent analysis of acoustic speech. The architectural solution of employing a bank of independent single-output TDNNs has been compared to the alternative solution of using only a single multi-output TDNN. Similarly, two different learning procedures have been applied and compared for training the TDNN, the first based on the classic mean square error (MSE) and the second based on a measure of cross-correlation (CC). The superiority of the system based on multiple single-output TDNNs has been proved as well as the improvements, both in terms of convergence speed and estimation fidelity, achievable through the learning algorithm based on cross-correlation
Keywords :
audio-visual systems; convergence of numerical methods; correlation methods; delays; image sequences; learning (artificial intelligence); least mean squares methods; multimedia communication; neural nets; parameter estimation; speech processing; synchronisation; acoustic speech; audio-video synchronization; coherent analysis; convergence speed; correlated components; cross-correlation measure; experimental results; facial sequences; home gaming; interactive entertainment; interpersonal communication services; learning algorithm; learning procedures; lip movement estimation; mean square error; movie production; multimedia applications; multimodal interfaces; multiple single-output TDNN; speech analysis; talking human faces; time-delay neural networks; training; video frame; Face; Humans; Image coding; Mean square error methods; Motion pictures; Mouth; Multimedia systems; Neural networks; Speech analysis; Timing;