• DocumentCode
    2341438
  • Title

    Accelerating approximate subsequence search on large protein sequence databases

  • Author

    Yang, Jiong ; Wang, Wei ; Xia, Yi ; Yu, Philip S.

  • Author_Institution
    IBM T. J. Watson Res. Center, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    207
  • Lastpage
    216
  • Abstract
    In this paper, we study the problem on how to build a persistent index structure for protein sequences to support approximate match. The suffix tree has been proposed as a solution to index sequence database and has been deployed on organizing DNA sequences (Hunt et al. (2001)). Unfortunately, it suffers from the problem of "memory bottleneck" that prevents it from being applied efficiently to a large database. The performance even degrades further for protein database due to a larger fanout at each node. Here, we employ an indexing structure, called BASS-tree, to support approximate match in sublinear time on a large protein database. We call this indexing method the sequence approximate match index method. The search of approximate matches can be properly directed to the portion in the database with a high potential of matching quickly. It is demonstrated in our experiments that the potential performance improvement is in an order of magnitude over alternative methods such as the BLAST algorithm and the suffix tree.
  • Keywords
    DNA; biology computing; database indexing; database management systems; pattern matching; proteins; search problems; tree data structures; BASS-tree; DNA sequences; approximate subsequence search; index structure; large protein sequence; molecular biology; protein database; sequence approximate match index method; suffix tree; Acceleration; Bioinformatics; DNA; Databases; Degradation; Indexes; Indexing; Organizing; Protein sequence; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics Conference, 2002. Proceedings. IEEE Computer Society
  • Print_ISBN
    0-7695-1653-X
  • Type

    conf

  • DOI
    10.1109/CSB.2002.1039343
  • Filename
    1039343