DocumentCode
580101
Title
Indexing genomic sequences on the IBM Blue Gene
Author
Ghoting, A. ; Makarychev, Konstantin
Author_Institution
IBM T. J. Watson Res. Center, Yorktown Heights, NY, USA
fYear
2009
fDate
14-20 Nov. 2009
Firstpage
1
Lastpage
11
Abstract
With advances in sequencing technology and through aggressive sequencing efforts, DNA sequence data sets have been growing at a rapid pace. To gain from these advances, it is important to provide life science researchers with the ability to process and query large sequence data sets. For the past three decades, the suffix tree has served as a fundamental data structure in processing sequential data sets. However, tree construction times on large data sets have been excessive. While parallel suffix tree construction is an obvious solution to reduce execution times, poor locality of reference has limited parallel performance. In this paper, we show that through careful parallel algorithm design, this limitation can be removed, allowing tree construction to scale to massively parallel systems like the IBM Blue Gene. We demonstrate that the entire Human genome can be indexed on 1024 processors in under 15 minutes.
Keywords
IBM computers; biocomputing; database indexing; genomics; parallel algorithms; query processing; tree data structures; DNA sequence data set; IBM Blue Gene; data structure; genomic sequence indexing; human genome; large sequence data set query; life science; parallel algorithm design; parallel performance; parallel suffix tree construction; parallel system; sequencing technology; sequential data set processing;
fLanguage
English
Publisher
ieee
Conference_Titel
High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on
Conference_Location
Portland, OR
Type
conf
DOI
10.1145/1654059.1654122
Filename
6375549
Link To Document