DocumentCode :
730659
Title :
Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks
Author :
Rao, Kanishka ; Fuchun Peng ; Sak, Hasim ; Beaufays, Francoise
Author_Institution :
Google Inc., Mountain View, CA, USA
fYear :
2015
fDate :
19-24 April 2015
Firstpage :
4225
Lastpage :
4229
Abstract :
Grapheme-to-phoneme (G2P) models are key components in speech recognition and text-to-speech systems as they describe how words are pronounced. We propose a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN). In contrast to traditional joint-sequence based G2P approaches, LSTMs have the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word-to-pronunciation conversion. Training joint-sequence based G2P require explicit grapheme-to-phoneme alignments which are not straightforward since graphemes and phonemes don´t correspond one-to-one. The LSTM based approach forgoes the need for such explicit alignments. We experiment with unidirectional LSTM (ULSTM) with different kinds of output delays and deep bidirectional LSTM (DBLSTM) with a connectionist temporal classification (CTC) layer. The DBLSTM-CTC model achieves a word error rate (WER) of 25.8% on the public CMU dataset for US English. Combining the DBLSTM-CTC model with a joint n-gram model results in a WER of 21.3%, which is a 9% relative improvement compared to the previous best WER of 23.4% from a hybrid system.
Keywords :
neural nets; speech recognition; speech synthesis; synchronisation; CTC layer; DBLSTM-CTC model; G2P models; RNN; ULSTM; US English; WER; connectionist temporal classification; deep bidirectional LSTM; grapheme-to-phoneme alignments; grapheme-to-phoneme conversion; grapheme-to-phoneme models; hybrid system; joint n-gram model; joint-sequence based G2P; long short-term memory recurrent neural networks; public CMU dataset; speech recognition; text-to-speech systems; unidirectional LSTM; word error rate; word-to-pronunciation conversion; Google; Indexes; Joints; CTC; G2P; LSTM; RNN; pronunciation; speech recognition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
Conference_Location :
South Brisbane, QLD
Type :
conf
DOI :
10.1109/ICASSP.2015.7178767
Filename :
7178767
Link To Document :
بازگشت