• DocumentCode
    105435
  • Title

    A MAP-based Online Estimation Approach to Ensemble Speaker and Speaking Environment Modeling

  • Author

    Yu Tsao ; Matsuda, Shodai ; Hori, Chiori ; Kashioka, Hideki ; Chin-Hui Lee

  • Author_Institution
    Res. Center for Inf. Technol. Innovation (CITI), Acad. Sinica, Taipei, Taiwan
  • Volume
    22
  • Issue
    2
  • fYear
    2014
  • fDate
    Feb. 2014
  • Firstpage
    403
  • Lastpage
    416
  • Abstract
    An ensemble speaker and speaking environment modeling (ESSEM) approach was recently developed. This ESSEM process consists of offline and online phases. The offline phase establishes an environment structure using speech data collected under a wide range of acoustic conditions, whereas the online phase estimates a set of acoustic models that matches the testing environment based on the established environment structure. Since the estimated acoustic models accurately characterize particular testing conditions, ESSEM can improve the speech recognition performance under adverse conditions. In this work, we propose two maximum a posteriori (MAP) based algorithms to improve the online estimation part of the original ESSEM framework. We first develop MAP-based environment structure adaptation to refine the original environment structure. Next, we propose to utilize the MAP criterion to estimate the mapping function of ESSEM and enhance the environment modeling capability. For the MAP estimation, three types of priors are derived; they are the clustered prior (CP), the sequential prior (SP), and the hierarchical prior (HP) densities. Since each prior density is able to characterize specific acoustic knowledge, we further derive a combination mechanism to integrate the three priors. Based on the experimental results on the Aurora-2 task, we verify that using the MAP-based online mapping function estimation can enable ESSEM to achieve better performance than using the maximum-likelihood (ML) based counterpart. Moreover, by using an integration of the online environment structuring adaptation and mapping function estimation, the proposed MAP-based ESSEM framework is found to provide the best performance. Compared with our baseline results, MAP-based ESSEM achieves an average word error rate reduction of 15.53% (5.41 to 4.57%) under 50 testing conditions at a signal-to-noise ratio (SNR) of 0 to 20 dB over the three standardized testing sets.
  • Keywords
    acoustic signal processing; maximum likelihood estimation; speaker recognition; Aurora-2 task; CP; ESSEM process; HP; MAP-based environment structure adaptation; MAP-based online estimation approach; SNR; SP; acoustic model estimation; average word error rate reduction; clustered prior; ensemble speaker; hierarchical prior; maximum a posteriori based algorithms; offline phase; online environment structuring adaptation; online mapping function estimation; online phase; sequential prior; signal-to-noise ratio; speaking environment modeling; speech recognition performance improvement; testing conditions; Acoustics; Adaptation models; Computational modeling; Estimation; Hidden Markov models; Testing; Training; ESSEM; Ensemble speaker and speaking environment modeling; MAP; noise robustness;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2013.2292362
  • Filename
    6671979