• DocumentCode
    1161256
  • Title

    Recent innovations in speech-to-text transcription at SRI-ICSI-UW

  • Author

    Stolcke, Andreas ; Chen, Barry ; Franco, Horacio ; Gadde, Venkata Ramana Rao ; Graciarena, Martin ; Hwang, Mei-Yuh ; Kirchhoff, Katrin ; Mandal, Arindam ; Morgan, Nelson ; Lei, Xin ; Ng, Tim ; Ostendorf, Mari ; Sönmez, Kemal ; Venkataraman, Anand ; Vergyr

  • Author_Institution
    SRI Int., Menlo Park, CA
  • Volume
    14
  • Issue
    5
  • fYear
    2006
  • Firstpage
    1729
  • Lastpage
    1744
  • Abstract
    We summarize recent progress in automatic speech-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin
  • Keywords
    Gaussian processes; cepstral analysis; grammars; multilayer perceptrons; natural languages; speech recognition; speech synthesis; text analysis; Arabic recognition; ICSI; Mandarin recognition; SRI; UW; University of Washington; acoustic adaptation; acoustic features; cepstral normalization; discriminative Gaussian estimation; discriminative phone posterior features; language modeling; language-specific adjustments; multilayer perceptron; multiple frame rates; optimal regression class complexity; phone-level macroaveraging; speech-to-text transcription; state-of-the-art recognition system; syntax-motivated almost-parsing language; vocabulary-selection techniques; Acoustic measurements; Adaptation model; Cepstral analysis; Computer science; Laboratories; Multilayer perceptrons; Natural languages; Speech recognition; Technological innovation; Telephony; Broadcast news (BN); conversational telephone speech (CTS); speech-to-text (STT);
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2006.879807
  • Filename
    1677992