Title :
Optimizing a MapReduce module of preprocessing high-throughput DNA sequencing data
Author :
Wei-Chun Chung ; Yu-Jung Chang ; Chien-Chih Chen ; Der-Tsai Lee ; Jan-Ming Ho
Author_Institution :
Res. Center for Inf. Technol. Innovation, Acad. Sinica Taipei, Taipei, Taiwan
Abstract :
The MapReduce framework has become the de facto choice for big data analysis in a variety of applications. In MapReduce programming model, computation is distributed to a cluster of computing nodes that runs in parallel. The performance of a MapReduce application is thus affected by system and middleware, characteristics of data, and design and implementation of the algorithms. In this study, we focus on performance optimization of a MapReduce application, i.e., CloudRS, which tackles on the problem of detecting and removing errors in the next-generation sequencing de novo genomic data. We present three strategies, i.e., content-exchange, content-grouping, and index-only strategies, of communication between the Map() and Reduce() functions. The three strategies differ in the way messages are exchanged between the two functions. We also present experimental results to compare performance of the three strategies.
Keywords :
biology computing; data analysis; middleware; molecular biophysics; parallel programming; CloudRS; MapReduce framework; MapReduce module; MapReduce programming model; big data analysis; content-exchange strategy; content-grouping strategy; data preprocessing; high-throughput DNA sequencing data; index-only strategy; middleware; next-generation sequencing data; performance optimization; Bioinformatics; Data handling; Data storage systems; Genomics; Information management; Optimization; Sequential analysis; error correction; genome assembly; mapreduce; next-generation sequencing; optimization;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691694