Title :
Text Clustering via Term Semantic Units
Author :
Jing, Liping ; Yun, Jiali ; Yu, Jian ; Huang, Houkuan
Author_Institution :
Dept. of Comput. Sci., Beijing Jiaotong Univ., Beijing, China
fDate :
Aug. 31 2010-Sept. 3 2010
Abstract :
How best to represent text data is an important problem in text mining tasks including information retrieval, clustering, classification and etc.. In this paper, we proposed a compact document representation with term semantic units which are identified from the implicit and explicit semantic information. Among it, the implicit semantic information is extracted from syntactic content via statistical methods such as latent semantic indexing and information bottleneck. The explicit semantic information is mined from the external semantic resource (Wikipedia). The proposed compact representation model can map a document collection in a low-dimension space (term semantic units which are much less than the number of all unique terms). Experimental results on real data sets have shown that the compact representation efficiently improve the performance of text clustering.
Keywords :
pattern clustering; statistical analysis; text analysis; Wikipedia; compact document representation model; explicit semantic information; external semantic resource; information retrieval; semantic information extraction; statistical methods; term semantic units; text clustering; compact representation; term semantic units; text clustering;
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on
Conference_Location :
Toronto, ON
Print_ISBN :
978-1-4244-8482-9
Electronic_ISBN :
978-0-7695-4191-4
DOI :
10.1109/WI-IAT.2010.23