Title : 
Active audio-visual integration for Voice Activity Detection based on a Causal Bayesian Network
         
        
            Author : 
Yoshida, Takafumi ; Nakadai, Kazuhiro
         
        
            Author_Institution : 
Grad. Sch. of Inf. Sci. & Eng., Tokyo Inst. of Technol., Tokyo, Japan
         
        
        
            fDate : 
Nov. 29 2012-Dec. 1 2012
         
        
        
        
            Abstract : 
This paper addresses an active audio-visual integration framework which integrates audio and visual information with a robot´s active motion for noise-robust Voice Activity Detection (VAD). VAD is crucial for noise robust Automatic Speech Recognition (ASR) because speech captured by a robot´s microphones is usually contaminated with other noise sources. To realize such noise-robust VAD, we propose Active Audio-Visual (AAV) integration framework which integrates auditory, visual and motion information using a Causal Bayesian Network (CBN). CBN is a subclass of Bayesian networks, which is able to estimate the effect on VAD performance caused by active motions. Since CBN is a general framework for information integration, we can naturally introduce various types of information such as the location of a speaker and a noise source which affect VAD performance to CBN, and CBN selects the optimal active motion for better perception of the robot using “intervention” mechanism in CBN. We implemented a prototype system based on the proposed framework on a humanoid robot called Hearbo. The proposed AAV-VAD is compared with three types of AV-VAD; simple AAV-VAD, multi-regression-based AAV-VAD, and stationary (not active) AV-VAD. A preliminary experiment using the prototype system showed that the VAD performance of the proposed AV-VAD was 14.4, 26.0, and 30.3 points higher than that of the simple active, multi-regression-based active, and stationary AV-VAD, respectively.
         
        
            Keywords : 
belief networks; human-robot interaction; humanoid robots; robot vision; speech recognition; AAV integration framework; ASR; CBN; Hearbo humanoid robot; VAD; active audio-visual integration; active audio-visual integration framework; auditory information; automatic speech recognition; causal Bayesian network; motion information; multiregression-based active AV-VAD; noise-robust VAD; noise-robust voice activity detection; simple active AV-VAD; stationary AV-VAD; visual information; voice activity detection; Microphones; Robots; Robustness; Speech;
         
        
        
        
            Conference_Titel : 
Humanoid Robots (Humanoids), 2012 12th IEEE-RAS International Conference on
         
        
            Conference_Location : 
Osaka
         
        
        
        
            DOI : 
10.1109/HUMANOIDS.2012.6651546