DocumentCode :
705214
Title :
Statistical digram and trigram analysis of Turkish in terms of coverage and entropy for possible language and speech based applications
Author :
Uslu, Ibrahim Baran ; Yilmaz, Asim Egemen ; Ilk, Hakki Gokhan
Author_Institution :
Electr. & Electron. Eng. Dept., Backent Univ., Ankara, Turkey
fYear :
2010
fDate :
23-27 Aug. 2010
Firstpage :
776
Lastpage :
780
Abstract :
In this study two frameworks, made up of digrams and trigrams, are built for a complete coverage of the Turkish language. In addition, character, digram and trigram entropy values for Turkish, English and Spanish are compared. Examining meaningful Turkish texts, we have achieved the result that, there are 3 major digram clusters which constitute slightly more than 60% of Turkish texts. Similar to digram distributions, there are 3 major trigram clusters which cover almost 40% of Turkish texts. The statistics show that, for 99% coverage of Turkish, 391 (of 841 theoretical) digrams and 3,396 (of 24,389 theoretical) trigrams are sufficient. The results of this study would constitute a general roadmap for rapid coverage to researchers who would like to work on Turkish language and speech based applications. As an application, the results could lead to a general framework for setting up the rules of prioritization in duration modeling in concatenative text-to-speech synthesis systems.
Keywords :
entropy; natural language processing; speech processing; speech synthesis; text analysis; English; Spanish; Turkish language based applications; Turkish texts; concatenative text-to-speech synthesis systems; coverage; digram distributions; digram entropy values; duration modeling; speech based applications; statistical digram analysis; statistical trigram analysis; trigram entropy values; Electronic publishing; Europe; Phasor measurement units; Signal processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Signal Processing Conference, 2010 18th European
Conference_Location :
Aalborg
ISSN :
2219-5491
Type :
conf
Filename :
7096487
Link To Document :
بازگشت