DocumentCode
52830
Title
Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis
Author
Jaskowiak, Pablo A. ; Campello, Ricardo J. G. B. ; Costa, Ivan G.
Author_Institution
Inst. of Math. & Comput. Sci., Univ. of Sao Paulo, Sao Carlos, Brazil
Volume
10
Issue
4
fYear
2013
fDate
July-Aug. 2013
Firstpage
845
Lastpage
857
Abstract
Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
Keywords
bioinformatics; cancer; genetics; pattern clustering; Intrinsic Biological Separation Ability; Pearson coefficient; Spearman coefficient; cancer; cluster analysis; euclidean distance; gene expression microarray data clustering; proximity measures; time course data; validation methodology; Cancer; Clustering algorithms; Correlation; Equations; Gene expression; Time complexity; Proximity measure; cancer; clustering; correlation coefficient; distance; gene expression; similarity; time course;
fLanguage
English
Journal_Title
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher
ieee
ISSN
1545-5963
Type
jour
DOI
10.1109/TCBB.2013.9
Filename
6461019
Link To Document