DocumentCode :
2183579
Title :
A semi-supervised document clustering algorithm based on EM
Author :
Rigutini, Leonardo ; Maggini, Marco
Author_Institution :
Dipt. di Ingegneria dell´´Informazione, Univ. di Siena, Italy
fYear :
2005
fDate :
19-22 Sept. 2005
Firstpage :
200
Lastpage :
206
Abstract :
Document clustering is a very hard task in automatic text processing since it requires extracting regular patterns from a document collection without a priori knowledge on the category structure. This task can be difficult also for humans because many different but valid partitions may exist for the same collection. Moreover, the lack of information about categories makes it difficult to apply effective feature selection techniques to reduce the noise in the representation of texts. Despite these intrinsic difficulties, text clustering is an important task for Web search applications in which huge collections or quite long query result lists must be automatically organized. Semi-supervised clustering lies in between automatic categorization and auto-organization. It is assumed that the supervisor is not required to specify a set of classes, but only to provide a set of texts grouped by the criteria to be used, to organize the collection. In this paper, we present a novel algorithm for clustering text documents which exploits the EM algorithm together with a feature selection technique based on information gain. The experimental results show that only very few documents are needed to initialize the clusters and that the algorithm is able to properly extract the regularities hidden in a huge unlabeled collection.
Keywords :
feature extraction; pattern clustering; text analysis; EM algorithm; Web search; auto-organization clustering; automatic categorization; feature selection; semisupervised document clustering; text document clustering; Clustering algorithms; Data mining; Feedback; Humans; Noise measurement; Noise reduction; Ontologies; Text categorization; Text processing; Web search; EM; Information Gain; Semi-supervised Document Clustering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on
Print_ISBN :
0-7695-2415-X
Type :
conf
DOI :
10.1109/WI.2005.13
Filename :
1517843
Link To Document :
بازگشت