DocumentCode :
3296967
Title :
An Improved Genetic Algorithm for Document Clustering with Semantic Similarity Measure
Author :
Song, Wei ; Park, Soon Cheol
Author_Institution :
Div. of Electron. & Inf. Eng., Chonbuk Nat. Univ., Jeonju
Volume :
1
fYear :
2008
fDate :
18-20 Oct. 2008
Firstpage :
536
Lastpage :
540
Abstract :
This paper proposes a self-organized genetic algorithm for document clustering based on semantic similarity measure. The traditional method to represent text is that the document is organized as a string of words, while the conceptual similarity is ignored. We take advantage of thesaurus-based ontology to overcome this problem. To investigate how ontology method could be used effectively in document clustering, a hybrid strategy which combines the thesaurus-based semantic similarity measure and vector space model (VSM) measure to provide more accurate assessment of similarity between documents are implemented. Considering the influence between the diversity of the population and the selective pressure, an approach of dynamic evolution operators is put forward in this article. In our experiment two data sets of 200 and 600 documents from Reuter-21578 corpus are excerpted for test and the experiment results show that our method of genetic algorithm in conjunction with the hybrid semantic strategy, the combination of the thesaurus-based measure and VSM-based measure, outperforms that with the sole VSM measure. Our clustering algorithm also efficiently enhances the performance of precision and recall in comparison with k-means in the same similarity environments.
Keywords :
document handling; genetic algorithms; ontologies (artificial intelligence); pattern clustering; thesauri; Reuter-21578 corpus; document clustering; improved genetic algorithm; semantic similarity measure; thesaurus-based ontology; vector space model; Clustering algorithms; Clustering methods; Extraterrestrial measurements; Genetic algorithms; Genetic engineering; Ontologies; Partitioning algorithms; Testing; Vocabulary; Web sites; Wordnet; clustering; genetic algorithm; semantic similarity measure;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Natural Computation, 2008. ICNC '08. Fourth International Conference on
Conference_Location :
Jinan
Print_ISBN :
978-0-7695-3304-9
Type :
conf
DOI :
10.1109/ICNC.2008.374
Filename :
4666903
Link To Document :
بازگشت