• DocumentCode
    677402
  • Title

    A hybrid method for word segmentation with English-Vietnamese bilingual text

  • Author

    Quoc Hung Ngo ; Dinh Dien ; Winiwarter, Werner

  • Author_Institution
    Comput. Sci. Fac., Univ. of Inf. Technol., Ho Chi Minh City, Vietnam
  • fYear
    2013
  • fDate
    25-28 Nov. 2013
  • Firstpage
    48
  • Lastpage
    52
  • Abstract
    This paper proposes a hybrid approach for Vietnamese word segmentation. The approach combines a dictionary-based method and a machine learning method to detect word boundaries in Vietnamese text by comparing English-Vietnamese pairs. We also point out several characteristics of Vietnamese which affect the Vietnamese word segmentation task and word alignment of English-Vietnamese text. Moreover, we built an English-Vietnamese bilingual corpus with nearly 10 million words, namely EVBCorpus, while a part of EVBNews has been manually segmented at the word level. We evaluate the performance of our approach by comparing its word segmentation results on this corpus. Our hybrid approach achieves 97% accuracy on the EVBNews corpus.
  • Keywords
    natural language processing; text analysis; EVBCorpus; EVBNews; English-Vietnamese bilingual corpus; English-Vietnamese bilingual text; English-Vietnamese pair comparison; English-Vietnamese text word alignment; Vietnamese word segmentation; dictionary-based method; hybrid method; machine learning method; word boundary detection; Accuracy; Conferences; Dictionaries; Hidden Markov models; Software; Support vector machines; Training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Control, Automation and Information Sciences (ICCAIS), 2013 International Conference on
  • Conference_Location
    Nha Trang
  • Print_ISBN
    978-1-4799-0569-0
  • Type

    conf

  • DOI
    10.1109/ICCAIS.2013.6720528
  • Filename
    6720528