مرکز منطقه ای اطلاع رساني علوم و فناوري - A Simple but Powerful Heuristic Method for Accelerating <formula formulatype="inline"> <img src="/images/tex/348.gif" alt="k"> </formula> -Means Clustering of Large-Scale Data in Life Science

DocumentCode :

1765025

Title :

A Simple but Powerful Heuristic Method for Accelerating $k$ -Means Clustering of Large-Scale Data in Life Science

Author :

Ichikawa, Kazuhisa ; Morishita, S.

Author_Institution :

Dept. of Comput. Biol., Univ. of Tokyo, Kashiwa, Japan

Volume :

Issue :

fYear :

2014

fDate :

July-Aug. 2014

Firstpage :

681

Lastpage :

692

Abstract :

K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (~400 MB in size) demonstrated marked reduction in computation time for k = 10-500 in comparison with other state-of-the-art pruning methods such as Elkan´s and Hamerly´s algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/~ichikawa/boostKCP/.

Keywords :

bioinformatics; data mining; optimisation; pattern clustering; BoostKCP software; Elkan algorithms; Hamerly algorithms; Pearson correlation distance; accelerating k-means clustering; biological systems; k-initial centroids; large-scale life science data; optimization methods; powerful heuristic method; real biological data sets; standardized Euclidean distance; Bioinformatics; Clustering algorithms; Computational biology; Correlation; Correlation coefficient; Euclidean distance; Bioinformatics; clustering; mining methods and algorithms; optimization;

fLanguage :

English

Journal_Title :

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

1545-5963

Type :

jour

DOI :

10.1109/TCBB.2014.2306200

Filename :

6739991

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1765025