مرکز منطقه ای اطلاع رساني علوم و فناوري - Simultaneous feature selection and unsupervised clustering for gene-expression data in multiobjective optimization framework

DocumentCode :

264988

Title :

Simultaneous feature selection and unsupervised clustering for gene-expression data in multiobjective optimization framework

Author :

Alok, Abhay Kumar ; Kanekar, Neha ; Saha, Sriparna ; Ekbal, Asif

Author_Institution :

Comput. Sci. Eng., Indian Inst. of Technol., Patna, Patna, India

fYear :

2014

fDate :

15-17 Dec. 2014

Firstpage :

Lastpage :

Abstract :

In this paper, the problem of simultaneous feature selection and automatic clustering is formulated as a multi-objective optimization task. Studying the patterns hidden in gene expression data helps to understand the functionality of genes. But due to the large volume of genes and the highly complex biological networks some sophisticated techniques are required to study available data consisting of large number of measurements. In general clustering techniques are used to identify natural partitioning and detect some interesting patterns from the given data as a first step of studying the gene expression data. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. A modern simulated annealing based multiobjective optimization technique namely AMOSA is utilized as the background optimization methodology. Here features and cluster centers are represented in the form of a string. Three optimization criteria are utilized: i) a function representing the total compactness of the partitioning based on the Euclidean distance, ii) a function representing the total compactness of the partitioning based on the point symmetry based distance and iii) a function counting the number of features. The objective is to optimize values of cluster validity indices where as to increase the number of features in order to remove the bias of internal cluster validity indices on dimensionality. Appropriate subset of features, proper number of clusters and the proper partitioning are determined using the search capability of AMOSA. In order to assign cluster label to all points, a recently introduced distance, namely point symmetry based distance, is utilized. Thus the effectiveness of this proposed Fea-GenClustMOO technique is shown for automatically clustering publicly available gene-expression data sets. Results are compared- with existing techniques for gene expression data clustering.

Keywords :

biology computing; feature selection; genetics; pattern clustering; set theory; simulated annealing; unsupervised learning; AMOSA; Euclidean distance; Fea-GenClustMOO technique; automatic clustering; background optimization methodology; complex biological networks; feature subset; function counting; gene expression data clustering; internal cluster validity indices; optimization criteria; point symmetry based distance; search capability; simulated annealing based multiobjective optimization technique; simultaneous feature selection; unsupervised clustering; Clustering algorithms; Computer science; Euclidean distance; Gene expression; Indexes; Linear programming; Optimization; AMOSA; Cluster validity index; Multiobjective optimization; Silhouette-index; Sym-index; XB-index;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Industrial and Information Systems (ICIIS), 2014 9th International Conference on

Conference_Location :

Gwalior

Print_ISBN :

978-1-4799-6499-4

Type :

conf

DOI :

10.1109/ICIINFS.2014.7036594

Filename :

7036594

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=264988