Title :
Enhancement of unsupervised feature selection for conditional random fields learning in Chinese word segmentation
Author :
Jiang, Mike Tian-Jian ; Hsu, Wen-Lian ; Kuo, Chan-Hung ; Yang, Ting-Hao
Author_Institution :
Dept. of Comput. Sci., Nat. Tsing Hua Univ., Hsinchu, Taiwan
Abstract :
This work proposed a unified view of several unsupervised feature selection based on frequent strings that improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS), term-contributed frequency (TCF), and term-contributed boundary (TCB), with a specific manner of boundary overlapping. For the experiment, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS; and the data set is acquired from SIGHAN CWS bakeoff 2005 and SIGHAN CWS 2010. The experiment results show that all of those features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F1 measure score, on both accuracy (F) and out-of-vocabulary recognition (FOOV). In particular, this work presents a novel feature selection approach of the compound feature “AVS+TCB” that outperforms other types of features for CRF-based CSW in terms of F and FOOV.
Keywords :
feature extraction; learning (artificial intelligence); natural language processing; text analysis; Chinese word segmentation; F1 measure score; SIGHAN CWS; accessor variety based string; boundary overlapping; character-based n-gram; conditional random field learning; frequent strings; out-of-vocabulary recognition; term-contributed boundary; term-contributed frequency; unsupervised feature selection enhancement; Accuracy; Arrays; Entropy; Feature extraction; Labeling; Rails; Training; Conditional random fields; accessor variety; term-contributed boundary; term-contributed frequency; unsupervised feature selection; word segmentation;
Conference_Titel :
Natural Language Processing andKnowledge Engineering (NLP-KE), 2011 7th International Conference on
Conference_Location :
Tokushima
Print_ISBN :
978-1-61284-729-0
DOI :
10.1109/NLPKE.2011.6138229