Title :
Romanian language statistics and resources for text-to-speech systems
Author :
Stan, Adriana ; Giurgiu, Mircea
Author_Institution :
Commun. Dept., Tech. Univ. of Cluj-Napoca, Cluj-Napoca, Romania
Abstract :
This paper introduces a series of results and experiments used in the development of a Romanian text-to-speech system, focusing on text statistics. We investigate the presence of several linguistic units used in text-to-speech systems, from phonemes to words. The text corpus we used, News-Romanian (News-RO) comprises 4500 newspaper articles. A subset of it, around 2500 sentences represents the Romanian Speech Synthesis (RSS) recorded speech database. The results offer an important insight to how should a speech database be designed. We also describe the methods used in the development of a 50,000 words Romanian lexicon with phonetic transcription and accent positioning. Such a lexicon is useful in machine learning algorithms of the front-end part of a text-to-speech system. As an addition we study the use of Maximal Onset Principle for Romanian syllabification.
Keywords :
audio databases; natural language processing; speech synthesis; statistics; News-Romanian; Romanian Speech Synthesis recorded speech database; Romanian language statistics; Romanian lexicon; Romanian syllabification; Romanian text-to-speech system; accent positioning; machine learning algorithms; maximal onset principle; newspaper articles; phonemes; phonetic transcription; sentences; text statistics; words; Databases; Europe; High temperature superconductors; Speech; Speech synthesis; Text processing; Training; Romanian; lexicon; speech synthesis; text-to-speech;
Conference_Titel :
Electronics and Telecommunications (ISETC), 2010 9th International Symposium on
Conference_Location :
Timisoara
Print_ISBN :
978-1-4244-8457-7
DOI :
10.1109/ISETC.2010.5679318