مرکز منطقه ای اطلاع رساني علوم و فناوري - Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks

DocumentCode :

730659

Title :

Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks

Author :

Rao, Kanishka ; Fuchun Peng ; Sak, Hasim ; Beaufays, Francoise

Author_Institution :

Google Inc., Mountain View, CA, USA

fYear :

2015

fDate :

19-24 April 2015

Firstpage :

4225

Lastpage :

4229

Abstract :

Grapheme-to-phoneme (G2P) models are key components in speech recognition and text-to-speech systems as they describe how words are pronounced. We propose a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN). In contrast to traditional joint-sequence based G2P approaches, LSTMs have the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word-to-pronunciation conversion. Training joint-sequence based G2P require explicit grapheme-to-phoneme alignments which are not straightforward since graphemes and phonemes don´t correspond one-to-one. The LSTM based approach forgoes the need for such explicit alignments. We experiment with unidirectional LSTM (ULSTM) with different kinds of output delays and deep bidirectional LSTM (DBLSTM) with a connectionist temporal classification (CTC) layer. The DBLSTM-CTC model achieves a word error rate (WER) of 25.8% on the public CMU dataset for US English. Combining the DBLSTM-CTC model with a joint n-gram model results in a WER of 21.3%, which is a 9% relative improvement compared to the previous best WER of 23.4% from a hybrid system.

Keywords :

neural nets; speech recognition; speech synthesis; synchronisation; CTC layer; DBLSTM-CTC model; G2P models; RNN; ULSTM; US English; WER; connectionist temporal classification; deep bidirectional LSTM; grapheme-to-phoneme alignments; grapheme-to-phoneme conversion; grapheme-to-phoneme models; hybrid system; joint n-gram model; joint-sequence based G2P; long short-term memory recurrent neural networks; public CMU dataset; speech recognition; text-to-speech systems; unidirectional LSTM; word error rate; word-to-pronunciation conversion; Google; Indexes; Joints; CTC; G2P; LSTM; RNN; pronunciation; speech recognition;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on

Conference_Location :

South Brisbane, QLD

Type :

conf

DOI :

10.1109/ICASSP.2015.7178767

Filename :

7178767

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=730659