Title :
A Chinese word segmentation algorithm based on maximum entropy
Author :
Zhang, Li-Yan ; Qin, Min ; Zhang, Xue-Mei ; Ma, Hong-Xia
Author_Institution :
Inst. of Inf., Heibei Univ. of Sci. & Technol., Shijiazhuang, China
Abstract :
Automatic word segmentation technology is an important component part of modern Chinese information processing. It is the key technology of the Chinese full-text retrieval. This paper presents a Chinese word segmentation algorithm based on maximum entropy. It uses of part-of-speech tagging and word frequency tagging of corpus to establish maximum entropy model based on mutual information as a word segmentation language model to make word segmentation. At last, the binary model was used to test whether the expansion of the training corpus may impact the word segmentation accuracy, and the relationship curves between the expansion of training corpus and the word segmentation accuracy was obtained.
Keywords :
entropy; information retrieval; text analysis; word processing; Chinese full-text retrieval; Chinese information processing; Chinese word segmentation algorithm; binary model; maximum entropy model; part-of-speech tagging; word frequency tagging; word segmentation language model; Accuracy; Computational modeling; Context; Entropy; Mathematical model; Probability; Training; Chinese full text retrieval; Maximum entropy; Word segmentation algorithm;
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
Conference_Location :
Qingdao
Print_ISBN :
978-1-4244-6526-2
DOI :
10.1109/ICMLC.2010.5580902