DocumentCode
3713063
Title
Automatic word segmentation for spoken Cantonese
Author
Roxana Fung;Brigitte Bigi
Author_Institution
The Hong Kong Polytechnic University, Department of Chinese and Bilingual Studies, Hung Hom, Hong Kong
fYear
2015
Firstpage
196
Lastpage
201
Abstract
Though Cantonese is the most influential variety of Chinese other than Mandarin, there are only a limited number of Cantonese corpora available for linguistic studies. Among the essential steps of building a corpus, word segmentation is a necessary but highly challenging task due to the lack of clear word boundary in Cantonese. This paper reports the construction and evaluation of an open-source automatic Cantonese word segmenter developed for Cantonese. The tool is a component of the multilingual SPPAS program designed to be used directly by linguists. It is a free software distributed under a GPL license. The effectiveness of the tool was evaluated by comparing the result of segmenting some samples of a spoken Cantonese corpus manually and automatically using the tool developed. High precision and recall were found in our study. Upon completion, the tool would definitely promote the development of more Cantonese corpora for language related studies.
Keywords
"Speech","Dictionaries","Pragmatics","Electronic mail","Buildings","Open source software"
Publisher
ieee
Conference_Titel
Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference
Type
conf
DOI
10.1109/ICSDA.2015.7357891
Filename
7357891
Link To Document