DocumentCode :
14777
Title :
COSAC: A Framework for Combinatorial Statistical Analysis on Cloud
Author :
Zhengkui Wang ; Agrawal, Deepak ; Kian-Lee Tan
Author_Institution :
NUS Grad. Sch. for Integrative Sci. & Eng., Nat. Univ. of Singapore, Singapore, Singapore
Volume :
25
Issue :
9
fYear :
2013
fDate :
Sept. 2013
Firstpage :
2010
Lastpage :
2023
Abstract :
In many scientific applications, it is critical to determine if there is a relationship between a combination of objects. The strength of such an association is typically computed using some statistical measures. In order not to miss any important associations, it is not uncommon to exhaustively enumerate all possible combinations of a certain size. However, discovering significant associations among hundreds of thousands or even millions of objects is a computationally intensive job that typically takes days, if not weeks, to complete. We are, therefore, motivated to provide efficient and practical techniques to speed up the processing exploiting parallelism. In this paper, we propose a framework, COSAC, for such combinatorial statistical analysis for large-scale data sets over a MapReduce-based cloud computing platform. COSAC operates in two key phases: 1) In the distribution phase, a novel load balancing scheme distributes the combination enumeration tasks across the processing units; 2) In the statistical analysis phase, each unit optimizes the processing of the allocated combinations by salvaging computations that can be reused. COSAC also supports a more practical scenario, where only a selected subset of objects need to be analyzed against all the objects. As a representative application, we developed COSAC to find combinations of Single Nucleotide Polymorphisms (SNPs) that may interact to cause diseases. We have evaluated our framework on a cluster of more than 40 nodes. The experimental results show that our framework is computationally practical, efficient, scalable, and flexible.
Keywords :
DNA; biology computing; cloud computing; combinatorial mathematics; diseases; parallel processing; statistical analysis; COSAC; MapReduce-based cloud computing platform; SNP; combination enumeration tasks; combinatorial statistical analysis framework; diseases; distribution phase; large-scale data sets; load balancing scheme; parallelism; processing units; scientific applications; single nucleotide polymorphisms; statistical analysis phase; statistical measures; Data engineering; Diseases; Indexes; Knowledge engineering; Optimization; Statistical analysis; Testing; Combinatorial statistical analysis; MapReduce; association mining; parallel object combination enumeration;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2012.113
Filename :
6205755
Link To Document :
بازگشت