DocumentCode
3490225
Title
A Pointwise Approach for Vietnamese Diacritics Restoration
Author
Luu, T.A. ; Yamamoto, Koji
Author_Institution
Dept. of Electr. Eng., Nagaoka Univ. of Technol., Nagaoka, Japan
fYear
2012
fDate
13-15 Nov. 2012
Firstpage
189
Lastpage
192
Abstract
The automatic insertion of diacritics in electronic texts is necessary for a number of languages, including French, Romanian, Croatian, Sindhi, Vietnamese, etc. When diacritics are removed from a word and the resulting string of characters is not a word, it is easy to recover the diacritics. However, sometimes the resulting string is also a word, possibly with different grammatical properties or a different meaning, and this makes recovery of the missing diacritics a difficult task for software as well as for human readers. This paper is the first to study automatic diacritic restoration in Vietnamese texts. Modern Vietnamese is a complex language with many diacritical marks, and white space does not always function as a word separator. This paper proposes a point wise approach for automatically recovering missing diacritics, using three features for classification: n-grams of syllables, n-grams of syllable types, and dictionary word features. Our experiments show that the proposed method can recover diacritics with a 94.7% accuracy rate.
Keywords
electronic publishing; natural language processing; pattern classification; Croatian; French; Romanian; Sindhi; Vietnamese diacritics restoration; automatic diacritic restoration; automatic insertion; dictionary word features; electronic texts; feature classification; grammatical properties; pointwise approach; word separator; Accuracy; Dictionaries; Hidden Markov models; Labeling; Support vector machine classification; Training; Training data; Vietnamese; automatic diacritic restoration; classification; natural language processing; pointwise approach;
fLanguage
English
Publisher
ieee
Conference_Titel
Asian Language Processing (IALP), 2012 International Conference on
Conference_Location
Hanoi
Print_ISBN
978-1-4673-6113-2
Electronic_ISBN
978-0-7695-4886-9
Type
conf
DOI
10.1109/IALP.2012.18
Filename
6473728
Link To Document