DocumentCode :
531974
Title :
Methodology and operation for extraction of Chinese segmentation units
Author :
Rong, Liu ; Zhiping, Zhang ; Ning, Pang
Author_Institution :
Foreign Language Coll., Taiyuan Univ. of Technol., Taiyuan, China
Volume :
5
fYear :
2010
fDate :
22-24 Oct. 2010
Abstract :
Chinese is written with the character of no space or other word delimiters. Chinese word segmentation (CWS) is the first step for Chinese language processing. Generally, words and fixed phrases(idioms, named entity) can be tagged successfully. However, besides words and fixed phrases, segmentation units should be tagged too. In the segmentation specification written by Liu Yuan, segmentation units are defined as “integrated closely and used steadily” phrases. There is no explicit definition to judge whether a phrase is integrated closely and used steadily or not, and no operable method to extract segmentation units till now. This paper puts forwards the principle to evaluate whether a phrase meets the standard of close integration and steady usage first, and then brings forward a detailed method to extract segmentation unit. We use a hybrid method to do the extraction task, which includs calculation of frequency, mutual information, entropy, and linguistic information of both syntactic rules and semantic explaination. We extract segmentation units in the frame of multi-features and do manual evaluation. Experiment shows that the approach is effective. Our job can provide a desirable list of segmentation units for NLP implication.
Keywords :
natural language processing; text analysis; CWS; Chinese language processing; Chinese segmentation unit extraction; Chinese word segmentation; NLP; entropy; frequency calculation; linguistic information; mutual information; syntactic rules; Semantics; World Wide Web; Chinese Segmentation Units; extraction; integrated closely and used steadily;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Application and System Modeling (ICCASM), 2010 International Conference on
Conference_Location :
Taiyuan
Print_ISBN :
978-1-4244-7235-2
Electronic_ISBN :
978-1-4244-7237-6
Type :
conf
DOI :
10.1109/ICCASM.2010.5619235
Filename :
5619235
Link To Document :
بازگشت