• DocumentCode
    743019
  • Title

    Single-Channel Speech-Music Separation for Robust ASR With Mixture Models

  • Author

    Demir, Cemil ; Saraclar, Murat ; Cemgil, A.T.

  • Author_Institution
    Speech & Language Technol. Lab, TUBITAK-BILGEM, Kocaeli, Turkey
  • Volume
    21
  • Issue
    4
  • fYear
    2013
  • fDate
    4/1/2013 12:00:00 AM
  • Firstpage
    725
  • Lastpage
    736
  • Abstract
    In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.
  • Keywords
    Gaussian processes; Markov processes; matrix decomposition; music; speech recognition; IS divergence measurement; Itakura-Saito divergence measurement; KL measurement; Kullback-Leibler divergence measurement; Markovian prior structures; NMF model; Poisson models; automatic speech recognition tasks; background music material; background music signal; complex Gaussian observation models; jingle representation; mixture model-based single-channel speech-music separation method; music spectrograms; nonnegative matrix factorization model; probabilistic interpretation; probabilistic model; robust ASR; scaled conditional mixture model; semisupervised manner; speech signal; superposed speech; temporal continuity; Catalogs; Data models; Hidden Markov models; Multiple signal classification; Probabilistic logic; Spectrogram; Speech; Gamma Markov chain; non-negative matrix factorization (NMF); single-channel; speech recognition; speech-music separation;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2012.2231072
  • Filename
    6365761