Title :
A Pointwise Approach for Vietnamese Diacritics Restoration
Author :
Luu, T.A. ; Yamamoto, Koji
Author_Institution :
Dept. of Electr. Eng., Nagaoka Univ. of Technol., Nagaoka, Japan
Abstract :
The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However, sometimes the resulting string is also a word, possibly with different grammatical properties or a different meaning, and this makes recovery of the missing diacritics a difficult task for software as well as for human readers. This paper is the first to study automatic diacritic restoration in Vietnamese texts. Modern Vietnamese is a complex language with many diacritical marks, and white space does not always function as a word separator. This paper proposes a point wise approach for automatically recovering missing diacritics, using three features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. Our experiments show that the proposed method can recover diacritics with a 94.7% accuracy rate.
Keywords :
electronic publishing; natural language processing; pattern classification; Croatian; French; Romanian; Sindhi; Vietnamese diacritics restoration; automatic diacritic restoration; automatic insertion; dictionary word features; electronic texts; feature classification; grammatical properties; pointwise approach; word separator; Accuracy; Dictionaries; Hidden Markov models; Labeling; Support vector machine classification; Training; Training data; Vietnamese; automatic diacritic restoration; classification; natural language processing; pointwise approach;
Conference_Titel :
Asian Language Processing (IALP), 2012 International Conference on
Conference_Location :
Hanoi
Print_ISBN :
978-1-4673-6113-2
Electronic_ISBN :
978-0-7695-4886-9
DOI :
10.1109/IALP.2012.18