Title :
A Genetic Niching Algorithm with Self-Adaptating Operator Rates for Document Clustering
Author :
Leon, Errol ; Gomez, Jose ; Nasraoui, Olfa
Abstract :
We propose a Genetic algorithm for document clustering, where an evolutionary multimodal optimization algorithm evolves candidate cluster representative solutions to search for dense regions in the sparse high dimensional vector space of text documents. The evolution affects not only the document cluster representatives but also the genetic operator rates which are evolved simultaneously with the document cluster representative solutions. The evolving population consists of candidate document cluster representatives that are encoded in the form of a sparse index and sparse index/frequency variable length vectors. In addition, specialized sparse genetic operators are defined for this special representation. The proposed specialized genetic operators achieve different degrees of exploitation and exploration in searching for the optimal document cluster prototypes, in particular the most specialized operator for the document clustering problem is the Sparse Top-K-Addition operator, which can be seen as an incentive towards a more aggressive exploitation of the local context in a small subset of documents, whereas the simple Sparse Real Addition operator works more in an exploratory manner. As shown in our experiments on two well-known document data sets, taking into account associated terms within a local context adds the benefit of an explicit latent semantic consideration in the search for optimal term lists to describe the cluster prototypes.
Keywords :
genetic algorithms; pattern clustering; text analysis; document clustering; evolutionary multimodal optimization algorithm; explicit latent semantic; frequency variable length vector; genetic niching algorithm; optimal document cluster prototype; self-adaptating operator rates; sparse index; sparse top-K-addition operator; specialized genetic operator; specialized sparse genetic operator; text document; Clustering algorithms; Frequency measurement; Genetics; Indexes; Mathematical model; Prototypes; Vectors; Genetic Clustering; Text Mining;
Conference_Titel :
Web Congress (LA-WEB), 2012 Eighth Latin American
Conference_Location :
Cartagena de Indias
Print_ISBN :
978-1-4673-4473-9
DOI :
10.1109/LA-WEB.2012.22