• DocumentCode
    48683
  • Title

    Video-Aided Model-Based Source Separation in Real Reverberant Rooms

  • Author

    Khan, M.S. ; Naqvi, Syed Mohsen ; Ur-Rehman, Ata ; Wenwu Wang ; Chambers, Jonathon

  • Author_Institution
    Adv. Signal Process. Group, Loughborough Univ., Loughborough, UK
  • Volume
    21
  • Issue
    9
  • fYear
    2013
  • fDate
    Sept. 2013
  • Firstpage
    1900
  • Lastpage
    1912
  • Abstract
    Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete time-frequency points. The model parameters are refined with the well-known expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better time-frequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited.
  • Keywords
    audio signal processing; blind source separation; expectation-maximisation algorithm; image enhancement; speech intelligibility; video signal processing; EM algorithm; a two-channel reverberant recording; audio data; audio-only algorithms; audio-visual algorithms; dereverberation based preprocessing; direction estimation; discrete time-frequency points; expectation-maximization algorithm; individual source reconstruction; interaural level difference; interaural phase difference; interaural spatial cues; localize individual speech sources; mixing vectors; real reverberant rooms; reverberant component suppression; source direction information; stereo mixture; time-frequency masks; under-determined highly reverberant settings; video-aided model-based source separation; visual modality; Expectation-maximization; reverberation; source separation; spatial cues; time-frequency masking;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2013.2261814
  • Filename
    6514058