Detecting Romanized Thai tokens in social media texts

Author

Moknarong, Nutthamon ; Suchato, Atiwong ; Punyabukkana, Proadpran

Author_Institution

Dept. of Comput. Eng., Chulalongkorn Univ., Bangkok, Thailand

fYear

2013

fDate

4-6 Sept. 2013

Firstpage

58

Lastpage

63

Abstract

Social media contents were created by a large number of users or writers. Additionally, each of them has their own writing styles, which depend on their creative thinking or attitudes. As commonly found in online social networks of Thai users, typed texts sometimes include Thai words that were transliterated with Roman letters. Therefore, text-to-speech systems cannot pronounce these transliterated tokens correctly. In this work, we propose and evaluate statistical methods for detecting Romanized Thai tokens. Both context-dependent and context-free classification features are proposed. Real social network texts are used for constructing the training set and the test set. Human subjects can detect Thai Romanized tokens at 91.16% accuracy on average when adjacent contexts are hidden while the accuracy is at 99.41% with contexts. With the proposed features, a decision tree-based classifier and an N-gram-based classifier yield 87.63% and 74.42% accuracy, respectively. In the later case, the accuracy increases to 82.60% when the tokens´ existence in English dictionaries is considered. Combining the two methods results in a detection accuracy of 89.36%.

Keywords

context-free languages; decision trees; dictionaries; feature extraction; natural language processing; pattern classification; social networking (online); statistical analysis; text analysis; English dictionaries; N-gram-based classifier; Roman letters; Romanized Thai token detection; Thai words; context-dependent classification feature; context-free classification feature; creative thinking; decision tree-based classifier; online social networks; social media contents; social media texts; statistical methods; test set; training set; typed texts; Accuracy; Decision trees; Dictionaries; Feature extraction; Media; Social network services; Training; Language identification; Social network; Statistical approach;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Science and Engineering Conference (ICSEC), 2013 International

Conference_Location

Nakorn Pathom

Print_ISBN

978-1-4673-5322-9

Type

conf

DOI

10.1109/ICSEC.2013.6694753

Filename

6694753