DocumentCode
672867
Title
Dice´s coefficient on trigram profiles as metric for language similarity
Author
Oco, Nathaniel ; Romeritch Syliongka, Leif ; Roxas, Rachel Edita ; Ilao, Joel
Author_Institution
Coll. of Comput. Studies, De La Salle Univ., Manila, Philippines
fYear
2013
fDate
25-27 Nov. 2013
Firstpage
1
Lastpage
4
Abstract
In this study, we present Dice´s coefficient on trigram profiles as metric for language similarity. As testbed, we focused on eight Philippine languages. No known language similarity value for these languages exists. Documents containing transcribed audio recordings, news articles, religious and literary texts were taken from an online corpus and used as training data. Character trigram profiles were then generated using an n-gram generator and language similarity was computed. The results were matched against those reported in the literature and against the language family tree. To evaluate the metric, it was applied to five languages with known similarity values. The results were then compared with an existing lexical similarity metric. The average difference is 27%. Analyses of the results reveal that phonetic spelling play an important role in language similarity. As future work, the metric can be used on phonetic transcriptions.
Keywords
audio recording; natural language processing; speech processing; Dice coefficient; Philippine languages; audio recordings; language similarity; literary texts; n-gram generator; news articles; phonetic transcriptions; religious texts; trigram profiles; Audio recording; Data models; Educational institutions; Measurement; Pragmatics; Presses; Training data; Dice´s coefficient; Philippine languages; closely-related languages; language similarity; trigram profiles;
fLanguage
English
Publisher
ieee
Conference_Titel
Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference
Conference_Location
Gurgaon
Type
conf
DOI
10.1109/ICSDA.2013.6709892
Filename
6709892
Link To Document