مرکز منطقه ای اطلاع رساني علوم و فناوري - ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

DocumentCode :

1018422

Title :

ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

Author :

Oehmen, Christopher ; Nieplocha, Jarek

Author_Institution :

Computational Sci. & Math. Div., Pacific Northwest Nat. Lab., Richland, WA

Volume :

Issue :

fYear :

2006

Firstpage :

740

Lastpage :

749

Abstract :

Genes in an organism\´s DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell\´s work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life" by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques - distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching - to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences

Keywords :

DNA; biology computing; distributed shared memory systems; genetics; proteins; sequences; storage management; very large databases; DNA; ScalaBLAST; bacterial genome; data prefetching; distributed memory; genetic code; high-performance data-intensive bioinformatics analysis; latency hiding; mammalian genome; multilevel parallelism; parallel I/O; protein components; protein information; sequence alignment; sequence matching problem; shared memory architecture; task scheduling; very large databases; Assembly; Bioinformatics; DNA; Data analysis; Databases; Genetics; Genomics; Microorganisms; Proteins; Sequences; BLAST; Global Arrays.; High-performance sequence alignment;

fLanguage :

English

Journal_Title :

Parallel and Distributed Systems, IEEE Transactions on

Publisher :

ieee

ISSN :

1045-9219

Type :

jour

DOI :

10.1109/TPDS.2006.112

Filename :

1652938

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1018422