Title :
Statistical digram and trigram analysis of Turkish in terms of coverage and entropy for possible language and speech based applications
Author :
Uslu, Ibrahim Baran ; Yilmaz, Asim Egemen ; Ilk, Hakki Gokhan
Author_Institution :
Electr. & Electron. Eng. Dept., Backent Univ., Ankara, Turkey
Abstract :
In this study two frameworks, made up of digrams and trigrams, are built for a complete coverage of the Turkish language. In addition, character, digram and trigram entropy values for Turkish, English and Spanish are compared. Examining meaningful Turkish texts, we have achieved the result that, there are 3 major digram clusters which constitute slightly more than 60% of Turkish texts. Similar to digram distributions, there are 3 major trigram clusters which cover almost 40% of Turkish texts. The statistics show that, for 99% coverage of Turkish, 391 (of 841 theoretical) digrams and 3,396 (of 24,389 theoretical) trigrams are sufficient. The results of this study would constitute a general roadmap for rapid coverage to researchers who would like to work on Turkish language and speech based applications. As an application, the results could lead to a general framework for setting up the rules of prioritization in duration modeling in concatenative text-to-speech synthesis systems.
Keywords :
entropy; natural language processing; speech processing; speech synthesis; text analysis; English; Spanish; Turkish language based applications; Turkish texts; concatenative text-to-speech synthesis systems; coverage; digram distributions; digram entropy values; duration modeling; speech based applications; statistical digram analysis; statistical trigram analysis; trigram entropy values; Electronic publishing; Europe; Phasor measurement units; Signal processing;
Conference_Titel :
Signal Processing Conference, 2010 18th European
Conference_Location :
Aalborg