• DocumentCode
    3490225
  • Title

    A Pointwise Approach for Vietnamese Diacritics Restoration

  • Author

    Luu, T.A. ; Yamamoto, Koji

  • Author_Institution
    Dept. of Electr. Eng., Nagaoka Univ. of Technol., Nagaoka, Japan
  • fYear
    2012
  • fDate
    13-15 Nov. 2012
  • Firstpage
    189
  • Lastpage
    192
  • Abstract
    The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However, sometimes the resulting string is also a word, possibly with different grammatical properties or a different meaning, and this makes recovery of the missing diacritics a difficult task for software as well as for human readers. This paper is the first to study automatic diacritic restoration in Vietnamese texts. Modern Vietnamese is a complex language with many diacritical marks, and white space does not always function as a word separator. This paper proposes a point wise approach for automatically recovering missing diacritics, using three features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. Our experiments show that the proposed method can recover diacritics with a 94.7% accuracy rate.
  • Keywords
    electronic publishing; natural language processing; pattern classification; Croatian; French; Romanian; Sindhi; Vietnamese diacritics restoration; automatic diacritic restoration; automatic insertion; dictionary word features; electronic texts; feature classification; grammatical properties; pointwise approach; word separator; Accuracy; Dictionaries; Hidden Markov models; Labeling; Support vector machine classification; Training; Training data; Vietnamese; automatic diacritic restoration; classification; natural language processing; pointwise approach;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2012 International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    978-1-4673-6113-2
  • Electronic_ISBN
    978-0-7695-4886-9
  • Type

    conf

  • DOI
    10.1109/IALP.2012.18
  • Filename
    6473728