Title :
Research on Popular Words and Phrases Extraction of Network Base on PAT TREE
Author :
Wu, Baozhen ; He, Tingting ; Zhang, Yong ; Li, Li ; Chen, Long
Author_Institution :
Dept. of Comput. Sci., Hua Zhong Normal Univ. Wuhan, Wuhan
Abstract :
This paper aims to mine popular words and phrases from internet by specific algorithm. We download web pages from Jan 1st 2007 to Jun 30th 2007 from different information sources of the network. We filter the set of the candidate words by three times in turn based on full segmentation with Pat-Tree. The first is the weight filter based on the vector space model, then used by the model of language regulation, the last through the filtration of rubbish cluster. Finally, we extract the popular words and phrases from the set of candidate words by the popular words determinant formula. At the same time we draw the tendcy curves of the popular words. The experimentation indicates that without reducing the correct rate of catchwords, the speed of computer-aided the popular words and phrases of network impoved distinctly.
Keywords :
Internet; data mining; natural languages; trees (mathematics); word processing; Internet; Pat-Tree; data mining; language regulation; natural languages; phrases extraction; vector space model; weight filter; Computer networks; Computer science; Helium; IP networks; Information filtering; Information filters; Natural languages; Software algorithms; Software engineering; Statistics; Chinese information processing; PAT TREE; Popular curves; Popular words and phrases of network;
Conference_Titel :
Computer Science and Software Engineering, 2008 International Conference on
Conference_Location :
Wuhan, Hubei
Print_ISBN :
978-0-7695-3336-0
DOI :
10.1109/CSSE.2008.1210