DocumentCode :
3107460
Title :
Semantic Smoothing for Model-based Document Clustering
Author :
Zhang, Xiaodan ; Zhou, Xiaohua ; Hu, Xiaohua
Author_Institution :
Coll. of Inf. Sci. & Technol., Drexel Univ., Phildelphia, PA
fYear :
2006
fDate :
18-22 Dec. 2006
Firstpage :
1193
Lastpage :
1198
Abstract :
A document is often full of class-independent "general" words and short of class-specific "core " words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most model-based clustering approaches use Laplacian smoothing to prevent zero probability while most similarity-based approaches employ the heuristic TF*IDF scheme to discount the effect of "general" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as context-sensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that model-based clustering approaches with semantic smoothing is effective in improving cluster quality.
Keywords :
document handling; information retrieval; pattern clustering; probability; smoothing methods; Laplacian smoothing; cluster quality; clustering quality; context-sensitive semantic smoothing; document models; model-based clustering; model-based document clustering; statistical translation language model; text retrieval; zero probability; Clustering algorithms; Context modeling; Educational institutions; Information retrieval; Information science; Laplace equations; Nearest neighbor searches; Probability; Smoothing methods; Vocabulary;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2006. ICDM '06. Sixth International Conference on
Conference_Location :
Hong Kong
ISSN :
1550-4786
Print_ISBN :
0-7695-2701-7
Type :
conf
DOI :
10.1109/ICDM.2006.142
Filename :
4053178
Link To Document :
بازگشت