Title :
Detecting Romanized Thai tokens in social media texts
Author :
Moknarong, Nutthamon ; Suchato, Atiwong ; Punyabukkana, Proadpran
Author_Institution :
Dept. of Comput. Eng., Chulalongkorn Univ., Bangkok, Thailand
Abstract :
Social media contents were created by a large number of users or writers. Additionally, each of them has their own writing styles, which depend on their creative thinking or attitudes. As commonly found in online social networks of Thai users, typed texts sometimes include Thai words that were transliterated with Roman letters. Therefore, text-to-speech systems cannot pronounce these transliterated tokens correctly. In this work, we propose and evaluate statistical methods for detecting Romanized Thai tokens. Both context-dependent and context-free classification features are proposed. Real social network texts are used for constructing the training set and the test set. Human subjects can detect Thai Romanized tokens at 91.16% accuracy on average when adjacent contexts are hidden while the accuracy is at 99.41% with contexts. With the proposed features, a decision tree-based classifier and an N-gram-based classifier yield 87.63% and 74.42% accuracy, respectively. In the later case, the accuracy increases to 82.60% when the tokens´ existence in English dictionaries is considered. Combining the two methods results in a detection accuracy of 89.36%.
Keywords :
context-free languages; decision trees; dictionaries; feature extraction; natural language processing; pattern classification; social networking (online); statistical analysis; text analysis; English dictionaries; N-gram-based classifier; Roman letters; Romanized Thai token detection; Thai words; context-dependent classification feature; context-free classification feature; creative thinking; decision tree-based classifier; online social networks; social media contents; social media texts; statistical methods; test set; training set; typed texts; Accuracy; Decision trees; Dictionaries; Feature extraction; Media; Social network services; Training; Language identification; Social network; Statistical approach;
Conference_Titel :
Computer Science and Engineering Conference (ICSEC), 2013 International
Conference_Location :
Nakorn Pathom
Print_ISBN :
978-1-4673-5322-9
DOI :
10.1109/ICSEC.2013.6694753