• DocumentCode
    48807
  • Title

    On Training Targets for Supervised Speech Separation

  • Author

    Yuxuan Wang ; Narayanan, Arun ; DeLiang Wang

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
  • Volume
    22
  • Issue
    12
  • fYear
    2014
  • fDate
    Dec. 2014
  • Firstpage
    1849
  • Lastpage
    1858
  • Abstract
    Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.
  • Keywords
    Fourier transforms; learning (artificial intelligence); matrix decomposition; neural nets; source separation; speech coding; speech intelligibility; time-frequency analysis; FFT-mask; Fourier transform spectral magnitude; Gammatone frequency power spectrum; IBM; IRM; ideal binary mask; ideal ratio mask; masking based targets; neural network; nonnegative matrix factorization; spectral envelope based targets; speech enhancement; speech intelligibility gains; supervised learning algorithm; supervised learning problem; supervised speech separation; target binary mask; time-frequency representation; Noise measurement; Signal to noise ratio; Speech; Speech processing; Supervised learning; Training; Deep neural networks; speech separation; supervised learning; training targets;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2014.2352935
  • Filename
    6887314