Title : 
Video tracking through occlusions by fast audio source localisation
         
        
            Author : 
D´Arca, Eleonora ; Hughes, Ashley ; Robertson, Neil M. ; Hopgood, James
         
        
            Author_Institution : 
Joint Res. Inst. for Signal & Image Process., Heriot-Watt Univ. & Univ. of Edinburgh, Edinburgh, UK
         
        
        
        
        
        
            Abstract : 
In this paper we present a novel audio-visual speaker detection and localisation algorithm. Audio source position estimates are computed by a novel stochastic region contraction (SRC) audio search algorithm for accurate speaker localisation. This audio search algorithm is aided by available video information (stochastic region contraction with height estimation (SRC-HE)) which estimates head heights over the whole scene and gives a speed improvement of 56% over SRC. We finally combine audio and video data in a Kalman filter (KF) which fuses person-position likelihoods and tracks the speaker. Our system is composed of a single video camera and 16 microphones. We validate the approach on the problem of video occlusion i.e. two people having a conversation have to be detected and localised at a distance (as in surveillance scenarios vs. enclosed meeting rooms). We show video occlusion can be resolved and speakers can be correctly detected/localised in real data. Moreover, SRC-HE based joint audio-video (AV) speaker tracking outperforms the one based on the original SRC by 16% and 4% in terms of multi object tracking precision (MOTP) and multi object tracking accuracy (MOTA). Speaker change detection improves by 11% over SRC.
         
        
            Keywords : 
Kalman filters; audio-visual systems; microphone arrays; natural scenes; object tracking; search problems; speaker recognition; stochastic processes; video surveillance; KF; Kalman filter; MOTA; MOTP; SRC audio search algorithm; SRC-HE based joint audio-video speaker tracking; audio data; audio source position estimation; audio-source localisation; audio-visual speaker detection algorithm; audio-visual speaker localisation algorithm; enclosed meeting rooms; head height estimation; height estimation; microphones; multiobject tracking accuracy; multiobject tracking precision; person-position likelihood fusion; speaker change detection improvement; speaker tracking; stochastic region contraction audio search algorithm; surveillance scenarios; video camera; video data; video information; video occlusion problem; video tracking; Multimodal tracking; Optimization methods; Sampling Methods; Speaker Tracking; Video Tracking;
         
        
        
        
            Conference_Titel : 
Image Processing (ICIP), 2013 20th IEEE International Conference on
         
        
            Conference_Location : 
Melbourne, VIC
         
        
        
            DOI : 
10.1109/ICIP.2013.6738548