DocumentCode :
3717272
Title :
Genomic analysis with MapReduce
Author :
Wei Yi Liu;Hui-I Hsiao;Shih Yao Dai
Author_Institution :
Data Analytics Technology & Applications Research Institute, Institute for Information Industry, Taipei, Taiwan
fYear :
2015
Firstpage :
1330
Lastpage :
1335
Abstract :
Genomic analysis [1] usually includes a pipeline of three stages: sequence alignment, data conversion, and advanced analysis. The analysis pipeline needs to handle hundreds of gigabytes of data as well as to run complex analytics algorithms, which traditionally takes long execution time (20+ hours) for a full genomes analysis. Parallelizing the execution of analytics algorithms is one way to speed up the process. Parallelizing genomic analysis is not a simple task, however, as it involves complicated splitting/distribution of data and merging of intermediate results. Our objective is to reduce the genomic analysis time to under an hour. To achieve this, we designed and implemented a distributed analysis pipeline that executes the pipeline in parallel on a Hadoop cluster (physical machines or VM nodes). Since Hadoop already handles work/job dispatching and work balance among distributed worker nodes, we need not handle node failure and load balancing required with a traditional distributed computing approach. Our major challenge is to run the genomic analysis pipeline effectively with Hadoop MapReduce and to ensure the correctness and quality of the analysis results. This paper discusses our work in the design and implementation of a highly parallelized genomic analysis pipeline. Our preliminary experiment results show that our parallelized pipeline using MapReduce improves analysis time by 447% while maintaining the result quality.
Keywords :
"Genomics","Bioinformatics","Pipelines","Algorithm design and analysis","Biological cells","DNA","Sequential analysis"
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/BigData.2015.7363891
Filename :
7363891
Link To Document :
بازگشت