مرکز منطقه ای اطلاع رساني علوم و فناوري - Unsupervised clustering of syllables for language identification

DocumentCode :

1845915

Title :

Unsupervised clustering of syllables for language identification

Author :

Dey, Subhadeep ; Murthy, Hema

Author_Institution :

Dept. of Comput. Sci. & Eng., IIT Madras, Chennai, India

fYear :

2012

fDate :

27-31 Aug. 2012

Firstpage :

325

Lastpage :

329

Abstract :

Automatic Language Recognition makes extensive use of phonotactics for identifying a language. The accuracy of phonotactic information depends upon the amount of data available for training. The state of the art approaches capture the phonotactics in terms of cross-lingual GMM tokens. The accuracy of such tokenisers crucially depends upon the availability of specific corpora. In this paper, we suggest an alternative to GMM tokens, namely, syllable based tokens. Syllables implicitly capture the phonotactics across phonemes in a language. Unsupervised Syllable tokenisation for language identification requires a) segmentation of speech into syllable-like units syllable level, and b) unsupervised modeling of the syllable tokens by Hidden Markov Models. The first issue is addressed by segmenting the wavform into syllable-like units using a well-established group delay based segmentation algorithm. To address the second issue, two different solutions are proposed, namely, (i) a top down clustering approach, which does not require significant parameter tuning, and is also robust, and (ii) a universal syllable approach. In this syllable models for every language are obtained from adapted universal syllable models. Experimental results on the OGI 1992 multilingual corpus and NIST 2003 LRE corpus show that the proposed approaches donot require significant tuning of parameters and the performance is comparable to that of a well-tuned baseline syllable tokenisation system.

Keywords :

hidden Markov models; natural language processing; pattern clustering; speech recognition; unsupervised learning; NIST 2003 LRE corpus; OGI 1992 multilingual corpus; automatic language recognition; cross-lingual GMM tokens; group delay based segmentation algorithm; hidden Markov models; language identification; phonotactic information; syllable based tokens; top down clustering approach; universal syllable approach; unsupervised clustering; unsupervised syllable tokenisation; Adaptation models; Clustering algorithms; Databases; Hidden Markov models; NIST; Speech; Training; syllable segmentation; top down syllable clustering; universal syllable models; unsupervised clustering;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European

Conference_Location :

Bucharest

ISSN :

2219-5491

Print_ISBN :

978-1-4673-1068-0

Type :

conf

Filename :

6333800

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1845915