Title :
Learning spoken words from multisensory input
Author :
Yu, Chen ; Ballard, Dana H.
Author_Institution :
Dept. of Comput. Sci., Rochester Univ., NY, USA
Abstract :
Speech recognition and speech translation are traditionally addressed by processing acoustic signals while nonlinguistic information is typically not used. We present a new method which explores the spoken word learning from naturally co-occurring multisensory information in a dyadic (two-person) conversation. It has been noticed that the listener always has a strong tendency to look toward objects referred to by the speaker during the conversation. In light of this, we propose to use eye gaze to integrate acoustic and visual signals, and build the audio-visual lexicons of objects. With such data gathered from conversations in different languages, the spoken names of objects in different languages can be translated based on their visual semantics. We have developed a multimodal learning system and report the results of experiments using speech, video in concert with eye movement records as training data.
Keywords :
acoustic signal processing; eye; language translation; natural languages; speech recognition; video signal processing; acoustic signal processing; acoustic signals; audio-visual lexicons; baking data; dyadic two-person conversation; eye movement records; multimodal learning system; multisensory information; multisensory input; nonlinguistic information; speech recognition; speech translation; spoken word learning; video processing; visual semantics; visual signals; Authentication; Computer science; Humans; Learning systems; Loudspeakers; Natural languages; Pediatrics; Signal processing; Speech processing; Speech recognition;
Conference_Titel :
Signal Processing, 2002 6th International Conference on
Print_ISBN :
0-7803-7488-6
DOI :
10.1109/ICOSP.2002.1179956