• DocumentCode
    730659
  • Title

    Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks

  • Author

    Rao, Kanishka ; Fuchun Peng ; Sak, Hasim ; Beaufays, Francoise

  • Author_Institution
    Google Inc., Mountain View, CA, USA
  • fYear
    2015
  • fDate
    19-24 April 2015
  • Firstpage
    4225
  • Lastpage
    4229
  • Abstract
    Grapheme-to-phoneme (G2P) models are key components in speech recognition and text-to-speech systems as they describe how words are pronounced. We propose a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN). In contrast to traditional joint-sequence based G2P approaches, LSTMs have the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word-to-pronunciation conversion. Training joint-sequence based G2P require explicit grapheme-to-phoneme alignments which are not straightforward since graphemes and phonemes don´t correspond one-to-one. The LSTM based approach forgoes the need for such explicit alignments. We experiment with unidirectional LSTM (ULSTM) with different kinds of output delays and deep bidirectional LSTM (DBLSTM) with a connectionist temporal classification (CTC) layer. The DBLSTM-CTC model achieves a word error rate (WER) of 25.8% on the public CMU dataset for US English. Combining the DBLSTM-CTC model with a joint n-gram model results in a WER of 21.3%, which is a 9% relative improvement compared to the previous best WER of 23.4% from a hybrid system.
  • Keywords
    neural nets; speech recognition; speech synthesis; synchronisation; CTC layer; DBLSTM-CTC model; G2P models; RNN; ULSTM; US English; WER; connectionist temporal classification; deep bidirectional LSTM; grapheme-to-phoneme alignments; grapheme-to-phoneme conversion; grapheme-to-phoneme models; hybrid system; joint n-gram model; joint-sequence based G2P; long short-term memory recurrent neural networks; public CMU dataset; speech recognition; text-to-speech systems; unidirectional LSTM; word error rate; word-to-pronunciation conversion; Google; Indexes; Joints; CTC; G2P; LSTM; RNN; pronunciation; speech recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
  • Conference_Location
    South Brisbane, QLD
  • Type

    conf

  • DOI
    10.1109/ICASSP.2015.7178767
  • Filename
    7178767