• DocumentCode
    661882
  • Title

    Detecting Romanized Thai tokens in social media texts

  • Author

    Moknarong, Nutthamon ; Suchato, Atiwong ; Punyabukkana, Proadpran

  • Author_Institution
    Dept. of Comput. Eng., Chulalongkorn Univ., Bangkok, Thailand
  • fYear
    2013
  • fDate
    4-6 Sept. 2013
  • Firstpage
    58
  • Lastpage
    63
  • Abstract
    Social media contents were created by a large number of users or writers. Additionally, each of them has their own writing styles, which depend on their creative thinking or attitudes. As commonly found in online social networks of Thai users, typed texts sometimes include Thai words that were transliterated with Roman letters. Therefore, text-to-speech systems cannot pronounce these transliterated tokens correctly. In this work, we propose and evaluate statistical methods for detecting Romanized Thai tokens. Both context-dependent and context-free classification features are proposed. Real social network texts are used for constructing the training set and the test set. Human subjects can detect Thai Romanized tokens at 91.16% accuracy on average when adjacent contexts are hidden while the accuracy is at 99.41% with contexts. With the proposed features, a decision tree-based classifier and an N-gram-based classifier yield 87.63% and 74.42% accuracy, respectively. In the later case, the accuracy increases to 82.60% when the tokens´ existence in English dictionaries is considered. Combining the two methods results in a detection accuracy of 89.36%.
  • Keywords
    context-free languages; decision trees; dictionaries; feature extraction; natural language processing; pattern classification; social networking (online); statistical analysis; text analysis; English dictionaries; N-gram-based classifier; Roman letters; Romanized Thai token detection; Thai words; context-dependent classification feature; context-free classification feature; creative thinking; decision tree-based classifier; online social networks; social media contents; social media texts; statistical methods; test set; training set; typed texts; Accuracy; Decision trees; Dictionaries; Feature extraction; Media; Social network services; Training; Language identification; Social network; Statistical approach;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Engineering Conference (ICSEC), 2013 International
  • Conference_Location
    Nakorn Pathom
  • Print_ISBN
    978-1-4673-5322-9
  • Type

    conf

  • DOI
    10.1109/ICSEC.2013.6694753
  • Filename
    6694753