Title :
Document clustering and cluster topic extraction in multilingual corpora
Author :
Silva, Joaquim ; Mexia, João ; Coelho, Agra ; Lopes, Gabriel
Author_Institution :
Univ. Nova de Lisboa, Lisbon, Portugal
Abstract :
A statistics-based approach for clustering documents and for extracting cluster topics is described relevant (meaningful) expressions (REs) automatically extracted from corpora are used as clustering base features. These features are transformed and its number is strongly reduced in order to obtain a small set of document classification features. This is achieved on the basis of principal components analysis. Model-based clustering analysis finds the best number of clusters. Then, the most important REs are extracted from each cluster and taken as document cluster topics
Keywords :
data mining; document handling; pattern clustering; cluster topic extraction; document classification features; document clustering; model-based clustering analysis; multilingual corpora; principal components analysis; relevant expressions; statistics-based approach; Agriculture; Data mining; Dispersion; Feature extraction; Instruction sets; Organizing; Probability; Size measurement;
Conference_Titel :
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
Conference_Location :
San Jose, CA
Print_ISBN :
0-7695-1119-8
DOI :
10.1109/ICDM.2001.989559