Title :
An automatic Chinese collocation extraction algorithm based on lexical statistics
Author :
Xu, Ruifeng ; Lu, Qin ; Li, Yin
Author_Institution :
Dept. of Comput., Hong Kong Polytech. Univ., China
Abstract :
We present an automatic Chinese collocation extraction system using lexical statistics and syntactical knowledge. This system extracts collocations from manually segmented and tagged Chinese news corpus in three stages. First, the bidirectional bigram statistical measures, including bidirectional strength and spread, and /spl chi//sup 2/ test value, are employed to extract candidate two-word pairs. These candidate word pairs are then used to extract high frequency multiword collocations from their context. In the third stage, precision is further improved by using syntactical knowledge of collocation patterns between content words to eliminate pseudo collocations. In the preliminary experiment on 30 selected headwords, this three-stage system achieves a 73% precision rate, a substantial improvement on the 61% achieved using an algorithm we developed earlier based on an improved version of the Smdja´s 53% accurate Xtract system.
Keywords :
computational linguistics; natural languages; automatic Chinese collocation extraction system; bidirectional bigram statistical measures; information extraction; lexical statistics; pseudo collocations; syntactical knowledge; Bidirectional control; Data mining; Frequency; Mutual information; Natural languages; Statistics; Sun; Testing;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location :
Beijing, China
Print_ISBN :
0-7803-7902-0
DOI :
10.1109/NLPKE.2003.1275923