مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

312238

Title :

A Mandarin text-to-speech system

Author :

Hwang, Shaw-Hwa ; Chen, Sin-Horng ; Wang, Yih-Ru

Author_Institution :

Dept. of Commun. Eng., Nat. Chiao Tung Univ., Hsinchu, Taiwan

Volume :

fYear :

1996

fDate :

3-6 Oct 1996

Firstpage :

1421

Abstract :

The implementation of a high-performance Mandarin TTS system is presented. The system is composed of four main parts: text analysis (TA), prosodic information generation (PIG), waveform table (WT) of 411 base-syllables, and PSOLA-based waveform synthesis (PSOLA). In TA, a statistical model based method is first employed to automatically tag the input text to obtain the word sequence and the associated part-of-speech (POS) sequence. A lexicon containing about 80000 words is used in the tagging process. Then the corresponding base-syllable sequence is found and used to get from WT the basic waveform sequence. Some linguistic features used in PIG are also extracted in TA. In PIG, a four-layer recurrent neural network (RNN) is employed to generate some prosodic information including pitch contour, energy level, initial duration and final duration of syllable as well as inter-syllable pause duration. Finally, in PSOLA the basic waveform sequence is modified using the prosodic information to generate output synthetic speech. The whole system is implemented by software on a PC/AT 486 with a 16-bit Sound Blaster add-on card. Only 3.2 Mbyte memory space is required. It can synthesize speech in real-time for any input Chinese text. Informal listening tests by many native Chinese living in Taiwan confirmed that the synthetic speech sounded very fluent and natural

Keywords :

computational linguistics; feedforward neural nets; multilayer perceptrons; recurrent neural nets; sequences; speech synthesis; statistical analysis; 16 bit; 3.2 Mbyte; Mandarin text-to-speech system; PC/AT 486; PSOLA-based waveform synthesis; Sound Blaster add-on card; automatic input text tagging; base syllables; energy level; final syllable duration; four-layer recurrent neural network; initial syllable duration; inter-syllable pause duration; lexicon; linguistic features; output synthetic speech; part-of-speech sequence; pitch contour; prosodic information generation; software; statistical model; text analysis; waveform table; word sequence; Acoustic testing; Contracts; Councils; Data mining; Energy states; Humans; Natural languages; Speech synthesis; Tagging; Text analysis;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on

Conference_Location :

Philadelphia, PA

Print_ISBN :

0-7803-3555-4

Type :

conf

DOI :

10.1109/ICSLP.1996.607881

Filename :

607881

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=312238