DocumentCode
1200020
Title
Parallel pattern identification in biological sequences on clusters
Author
Huang, Chun-Hsi ; Rajasekaran, Sanguthevar
Author_Institution
Dept. of Comput. Sci. & Eng., Univ. of Connecticut, Storrs, CT, USA
Volume
2
Issue
1
fYear
2003
fDate
3/1/2003 12:00:00 AM
Firstpage
29
Lastpage
34
Abstract
Tandem repeats are ubiquitous sequence features in both prokaryotic and eukaryotic genomes. They are known to cause several inherited neurological diseases in humans. Identifying these patterns is a highly computation-intensive process. Previous parallel implementations use straightforward domain decomposition based on existing sequential algorithms and rely on parallel machines with low-latency interconnection network and fast hardware support for processor synchronization. Our research exploits the superior cost effectiveness and flexibility achieved through low-cost clusters to speed up biological computations by designing communication-efficient parallel algorithms for pattern identification. This paper presents a low communication-overhead parallel algorithm for pattern identification in biological sequences. Given a biological sequence of length n and a pattern of length m, we conclude an algorithm with five computation/communication phases, each requiring O(n) computation time and only O(p) message units. The low communication overhead of the algorithm is essential in achieving reasonable speedups on clusters, where the interprocessor communication latency is usually higher.
Keywords
biology computing; diseases; genetics; molecular biophysics; parallel algorithms; pattern matching; sequences; biological computations; biological sequences; clusters; communication-efficient parallel algorithms; cost effectiveness; eukaryotic genomes; five computation/communication phases; flexibility; highly computation-intensive process; humans; inherited neurological diseases; interprocessor communication latency; low communication-overhead parallel algorithm; low-cost clusters; message units; parallel pattern identification; prokaryotic genomes; tandem repeats; ubiquitous sequence features; Bioinformatics; Biology computing; Clustering algorithms; Diseases; Genomics; Humans; Multiprocessor interconnection networks; Parallel algorithms; Parallel machines; Pervasive computing; Algorithms; Cluster Analysis; Computer Communication Networks; Computing Methodologies; Gene Expression Profiling; Pattern Recognition, Automated; Sequence Alignment; Sequence Analysis, DNA; Tandem Repeat Sequences;
fLanguage
English
Journal_Title
NanoBioscience, IEEE Transactions on
Publisher
ieee
ISSN
1536-1241
Type
jour
DOI
10.1109/TNB.2003.810165
Filename
1198675
Link To Document