Title :
Speech Enhancement and Recognition in Meetings With an Audio–Visual Sensor Array
Author :
Maganti, Hari Krishna ; Gatica-Perez, Daniel ; McCowan, Iain
Author_Institution :
Inst. of Neural Inf. Process., Univ. of Ulm, Ulm
Abstract :
This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.
Keywords :
array signal processing; audio signal processing; audio-visual systems; filtering theory; microphone arrays; speaker recognition; speech enhancement; tracking filters; audio-visual multiperson tracker; audio-visual sensor array; distant speech acquisition problem; microphone array beamforming techniques; multiparty meetings; postfiltering stage; speech enhancement; speech recognition; Array signal processing; Cameras; Filtering; Microphone arrays; Performance evaluation; Sensor arrays; Sensor systems; Speech analysis; Speech enhancement; Speech recognition; Audio–visual fusion; microphone array processing; multiobject tracking; speech enhancement; speech recognition;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2007.906197