مرکز منطقه ای اطلاع رساني علوم و فناوري - Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction

DocumentCode :

251562

Title :

Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction

Author :

Hong Liu ; Ting Fan ; Pingping Wu

Author_Institution :

Eng. Lab. on Intell. Perception for Internet of Things(ELIP), Peking Univ., Beijing, China

fYear :

2014

fDate :

May 31 2014-June 7 2014

Firstpage :

6644

Lastpage :

6651

Abstract :

Keyword spotting (KWS) deals with the identification of keywords in unconstrained speech, which is a natural, straightforward and friendly way for human-robot interaction (HRI). Most keyword spotters have the common problem of noise-robustness when applied to real-world environment with dramatically changing noises. Since visual information won´t be affected by the acoustic noise, it can be utilized to complementarily improve the noise-robustness. In this paper, a novel audio-visual keyword spotting approach based on adaptive decision fusion under noisy conditions is proposed. In order to accurately represent the appearance and movement of mouth region, an improved local binary pattern from three orthogonal planes (ILBP-TOP) is proposed. Besides, a parallel two-step recognition based on acoustic and visual keyword candidates is conducted and generates corresponding acoustic and visual scores for each keyword candidate. Optimal weights for combining acoustic and visual contributions under diverse noise conditions are generated using a neural network based on reliabilities of the two modalities. Experiments show that our proposed audio-visual keyword spotting based on decision fusion significantly improves the noise robustness and attains better performance than feature fusion based audiovisual spotter. Additionally, ILBP-TOP shows more competitive performance than LBP-TOP.

Keywords :

audio signal processing; decision theory; human-robot interaction; image fusion; image representation; neural nets; object recognition; speech recognition; ILBP-TOP; acoustic keyword candidates; acoustic scores; adaptive decision fusion; audio-visual keyword spotting approach; automatic speech recognition; human-robot interaction; improved local binary pattern; keyword identification; mouth region appearance representation; mouth region movement representation; neural network; noise-robustness problem; noisy conditions; parallel two-step recognition; three orthogonal planes; unconstrained speech; visual keyword candidates; visual scores; Acoustics; Feature extraction; Hidden Markov models; Mouth; Noise measurement; Reliability; Visualization;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Robotics and Automation (ICRA), 2014 IEEE International Conference on

Conference_Location :

Hong Kong

Type :

conf

DOI :

10.1109/ICRA.2014.6907840

Filename :

6907840

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=251562