DocumentCode :
264988
Title :
Simultaneous feature selection and unsupervised clustering for gene-expression data in multiobjective optimization framework
Author :
Alok, Abhay Kumar ; Kanekar, Neha ; Saha, Sriparna ; Ekbal, Asif
Author_Institution :
Comput. Sci. Eng., Indian Inst. of Technol., Patna, Patna, India
fYear :
2014
fDate :
15-17 Dec. 2014
Firstpage :
1
Lastpage :
6
Abstract :
In this paper, the problem of simultaneous feature selection and automatic clustering is formulated as a multi-objective optimization task. Studying the patterns hidden in gene expression data helps to understand the functionality of genes. But due to the large volume of genes and the highly complex biological networks some sophisticated techniques are required to study available data consisting of large number of measurements. In general clustering techniques are used to identify natural partitioning and detect some interesting patterns from the given data as a first step of studying the gene expression data. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. A modern simulated annealing based multiobjective optimization technique namely AMOSA is utilized as the background optimization methodology. Here features and cluster centers are represented in the form of a string. Three optimization criteria are utilized: i) a function representing the total compactness of the partitioning based on the Euclidean distance, ii) a function representing the total compactness of the partitioning based on the point symmetry based distance and iii) a function counting the number of features. The objective is to optimize values of cluster validity indices where as to increase the number of features in order to remove the bias of internal cluster validity indices on dimensionality. Appropriate subset of features, proper number of clusters and the proper partitioning are determined using the search capability of AMOSA. In order to assign cluster label to all points, a recently introduced distance, namely point symmetry based distance, is utilized. Thus the effectiveness of this proposed Fea-GenClustMOO technique is shown for automatically clustering publicly available gene-expression data sets. Results are compared- with existing techniques for gene expression data clustering.
Keywords :
biology computing; feature selection; genetics; pattern clustering; set theory; simulated annealing; unsupervised learning; AMOSA; Euclidean distance; Fea-GenClustMOO technique; automatic clustering; background optimization methodology; complex biological networks; feature subset; function counting; gene expression data clustering; internal cluster validity indices; optimization criteria; point symmetry based distance; search capability; simulated annealing based multiobjective optimization technique; simultaneous feature selection; unsupervised clustering; Clustering algorithms; Computer science; Euclidean distance; Gene expression; Indexes; Linear programming; Optimization; AMOSA; Cluster validity index; Multiobjective optimization; Silhouette-index; Sym-index; XB-index;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Industrial and Information Systems (ICIIS), 2014 9th International Conference on
Conference_Location :
Gwalior
Print_ISBN :
978-1-4799-6499-4
Type :
conf
DOI :
10.1109/ICIINFS.2014.7036594
Filename :
7036594
Link To Document :
بازگشت