DocumentCode :
3678371
Title :
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Data
Author :
Mucahid Kutlu;Gagan Agrawal
Author_Institution :
Dept. of Comput. Sci. &
fYear :
2015
Firstpage :
332
Lastpage :
341
Abstract :
As development of high-throughput and low-cost sequencing technologies is leading to massive volumes of genomic data, new solutions for handling data-intensive applications on parallel platforms are urgently required. Particularly, the nature of processing leads to both load balancing and I/O contention challenges. In this paper, we have developed a novel middleware system, RE-PAGE, which allows parallelization of applications that process genomic data with a simple, high-level API. To address load balancing and I/O contention, the features of the middleware include: 1) use of domain-specific information in the formation of data chunks (which can be of non-uniform sizes), 2) replication and placement of each chunk on a small number of nodes, performed in an intelligent way, and 3) scheduling schemes for achieving load balance, when data movement costs out-weigh processing costs and the chunks are of non-uniform sizes. We have evaluated our framework using three genomic applications, which are VarScan, Unified Genotyper, and Coverage Analyzer. We show that our approach leads to better performance than conventional MapReduce scheduling approaches and systems that access data from a centralized store. We also compare against popular frameworks, Hadoop and GATK, and show that our middleware outperforms both, achieving high parallel efficiency and scalability.
Keywords :
"Genomics","Bioinformatics","Middleware","Load management","Processor scheduling","Sequential analysis"
Publisher :
ieee
Conference_Titel :
Cluster Computing (CLUSTER), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/CLUSTER.2015.54
Filename :
7307601
Link To Document :
بازگشت