• DocumentCode
    3713063
  • Title

    Automatic word segmentation for spoken Cantonese

  • Author

    Roxana Fung;Brigitte Bigi

  • Author_Institution
    The Hong Kong Polytechnic University, Department of Chinese and Bilingual Studies, Hung Hom, Hong Kong
  • fYear
    2015
  • Firstpage
    196
  • Lastpage
    201
  • Abstract
    Though Cantonese is the most influential variety of Chinese other than Mandarin, there are only a limited number of Cantonese corpora available for linguistic studies. Among the essential steps of building a corpus, word segmentation is a necessary but highly challenging task due to the lack of clear word boundary in Cantonese. This paper reports the construction and evaluation of an open-source automatic Cantonese word segmenter developed for Cantonese. The tool is a component of the multilingual SPPAS program designed to be used directly by linguists. It is a free software distributed under a GPL license. The effectiveness of the tool was evaluated by comparing the result of segmenting some samples of a spoken Cantonese corpus manually and automatically using the tool developed. High precision and recall were found in our study. Upon completion, the tool would definitely promote the development of more Cantonese corpora for language related studies.
  • Keywords
    "Speech","Dictionaries","Pragmatics","Electronic mail","Buildings","Open source software"
  • Publisher
    ieee
  • Conference_Titel
    Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference
  • Type

    conf

  • DOI
    10.1109/ICSDA.2015.7357891
  • Filename
    7357891