Title :
Text Clustering by 2D Cellular Automata Based on the N-Grams
Author :
Hamou, Reda Mohamed ; Lehireche, Ahmed ; Lokbani, Ahmed Chaouki ; Rahmani, Mohamed
Author_Institution :
EEDIS Lab., Evolutionary Eng. & Distrib. Inf. Syst. Lab., Univ. Dr Tahar MOULAY of Saida, Saida, Algeria
Abstract :
In this article we present a 2D cellular automaton (Class_AC) to solve a problem of text mining in the case of unsupervised classification (clustering). Before to experiment the cellular automaton, we vectorized our data indexing textual documents from the database REUTERS 21,578 by the approach of N-grams. The cellular automaton that we propose in this paper is a grid cell structure with a flat neighborhood arising from this structure (planar). Three functions of transitions were used to vary the automaton with four states for each cell. The results obtained show that the virtual machine parallel computing (Class_AC) effectively includes similar documents on near threshold. Section 1 gives an introduction, Section 2 presents representation of texts based on the n grams, Section 3 describes the cellular automaton for clustering, Section 4 shows the experimentation and comparison results and finally Section 5 gives a conclusion and perspectives.
Keywords :
cellular automata; indexing; parallel processing; pattern clustering; text analysis; 2D cellular automata; N-grams; data indexing; text clustering; textual documents; unsupervised classification; virtual machine parallel computing; Automata; Biological system modeling; Classification algorithms; Entropy; Laboratories; Support vector machine classification; Text mining; Cellular Automata; Data classification; biomimetic methods; clustering and segmentation; data mining; unsupervised classification;
Conference_Titel :
Cryptography and Network Security, Data Mining and Knowledge Discovery, E-Commerce & Its Applications and Embedded Systems (CDEE), 2010 First ACIS International Symposium on
Conference_Location :
Qinhuangdao
Print_ISBN :
978-1-4244-9595-5
DOI :
10.1109/CDEE.2010.60