Title :
Fast co-occurrence thesaurus construction for Chinese news
Author :
Tseng, Yuen-Hsien
Author_Institution :
Dept. of Libr. & Inf. Sci., Fu Jen Catholic Univ., Taipei, Taiwan
Abstract :
This paper reports an approach to automatic thesaurus construction for Chinese news articles. An effective Chinese word segmentation and keyword extraction algorithm is first presented. For each document, an average of 33% keywords unknown to a lexicon of 123,226 terms can be identified. The extraction error rate is 3.6%. Keywords extracted from each document are then further filtered for term association analysis by a modified Dice coefficient formula. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method not only speeds up the thesaurus generation process drastically, but also achieves a similar percentage level of term relatedness
Keywords :
information retrieval; thesauri; Chinese keyword extraction algorithm; Chinese news articles; Chinese word segmentation algorithm; association weights; automatic thesaurus construction; extraction error rate; fast co-occurrence thesaurus construction; final term pair similarities; modified Dice coefficient formula; term association analysis; term relatedness; Books; Data mining; Databases; Error analysis; Frequency; Information retrieval; Information science; Libraries; Statistics; Thesauri;
Conference_Titel :
Systems, Man, and Cybernetics, 2001 IEEE International Conference on
Conference_Location :
Tucson, AZ
Print_ISBN :
0-7803-7087-2
DOI :
10.1109/ICSMC.2001.973022