• DocumentCode
    739682
  • Title

    Morphologically Filtered Power-Normalized Cochleograms as Robust, Biologically Inspired Features for ASR

  • Author

    de-la-Calle-Silos, Fernando ; Valverde-Albacete, Francisco J. ; Gallardo-Antolin, Ascension ; Pelaez-Moreno, Carmen

  • Author_Institution
    Dept. of Signal Theor. & Commun., Univ. Carlos III de Madrid, Leganés, Spain
  • Volume
    23
  • Issue
    11
  • fYear
    2015
  • Firstpage
    2070
  • Lastpage
    2080
  • Abstract
    In this paper, we present advances in the modeling of the masking behavior of the human auditory system (HAS) to enhance the robustness of the feature extraction stage in automatic speech recognition (ASR). The solution adopted is based on a nonlinear filtering of a spectro-temporal representation applied simultaneously to both frequency and time domains - as if it were an image - using mathematical morphology operations. A particularly important component of this architecture is the so-called structuring element (SE) that in the present contribution is designed as a single three-dimensional pattern using physiological facts, in such a way that closely resembles the masking phenomena taking place in the cochlea. A proper choice of spectro-temporal representation lends validity to the model throughout the whole frequency spectrum and intensity spans assuming the variability of the masking properties of the HAS in these two domains. The best results were achieved with the representation introduced as part of the power normalized cepstral coefficients (PNCC) together with a spectral subtraction step. This method has been tested on Aurora 2, Wall Street Journal and ISOLET databases including both classical hidden Markov model (HMM) and hybrid artificial neural networks (ANN)-HMM back-ends. In these, the proposed front-end analysis provides substantial and significant improvements compared to baseline techniques: up to 39.5% relative improvement compared to MFCC, and 18.7% compared to PNCC in the Aurora 2 database.
  • Keywords
    cepstral analysis; ear; feature extraction; frequency-domain analysis; hidden Markov models; mathematical morphology; neural nets; nonlinear filters; speech recognition; time-domain analysis; ASR; Aurora 2 database; HAS; ISOLET databases; Wall Street Journal; automatic speech recognition; classical hidden Markov model; feature extraction stage; frequency domain; frequency spectrum; front-end analysis; human auditory system; hybrid artificial neural networks-HMM back-ends; intensity spans; masking behavior; masking phenomena; masking properties; mathematical morphology operations; morphologically filtered power-normalized cochleograms; nonlinear filtering; physiological facts; power normalized cepstral coefficients; single three-dimensional pattern; spectral subtraction step; spectro-temporal representation; structuring element; time domain; Databases; Feature extraction; Hidden Markov models; IEEE transactions; Psychoacoustic models; Speech; Speech processing; Auditory-based features; automatic speech recognition (ASR); cochlear masking models; morphological filtering; power normalized cepstral coefficients (PNCC); spectro-temporal processing;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2015.2464691
  • Filename
    7180326