• DocumentCode
    1200020
  • Title

    Parallel pattern identification in biological sequences on clusters

  • Author

    Huang, Chun-Hsi ; Rajasekaran, Sanguthevar

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Univ. of Connecticut, Storrs, CT, USA
  • Volume
    2
  • Issue
    1
  • fYear
    2003
  • fDate
    3/1/2003 12:00:00 AM
  • Firstpage
    29
  • Lastpage
    34
  • Abstract
    Tandem repeats are ubiquitous sequence features in both prokaryotic and eukaryotic genomes. They are known to cause several inherited neurological diseases in humans. Identifying these patterns is a highly computation-intensive process. Previous parallel implementations use straightforward domain decomposition based on existing sequential algorithms and rely on parallel machines with low-latency interconnection network and fast hardware support for processor synchronization. Our research exploits the superior cost effectiveness and flexibility achieved through low-cost clusters to speed up biological computations by designing communication-efficient parallel algorithms for pattern identification. This paper presents a low communication-overhead parallel algorithm for pattern identification in biological sequences. Given a biological sequence of length n and a pattern of length m, we conclude an algorithm with five computation/communication phases, each requiring O(n) computation time and only O(p) message units. The low communication overhead of the algorithm is essential in achieving reasonable speedups on clusters, where the interprocessor communication latency is usually higher.
  • Keywords
    biology computing; diseases; genetics; molecular biophysics; parallel algorithms; pattern matching; sequences; biological computations; biological sequences; clusters; communication-efficient parallel algorithms; cost effectiveness; eukaryotic genomes; five computation/communication phases; flexibility; highly computation-intensive process; humans; inherited neurological diseases; interprocessor communication latency; low communication-overhead parallel algorithm; low-cost clusters; message units; parallel pattern identification; prokaryotic genomes; tandem repeats; ubiquitous sequence features; Bioinformatics; Biology computing; Clustering algorithms; Diseases; Genomics; Humans; Multiprocessor interconnection networks; Parallel algorithms; Parallel machines; Pervasive computing; Algorithms; Cluster Analysis; Computer Communication Networks; Computing Methodologies; Gene Expression Profiling; Pattern Recognition, Automated; Sequence Alignment; Sequence Analysis, DNA; Tandem Repeat Sequences;
  • fLanguage
    English
  • Journal_Title
    NanoBioscience, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1536-1241
  • Type

    jour

  • DOI
    10.1109/TNB.2003.810165
  • Filename
    1198675