• DocumentCode
    3334513
  • Title

    Approximate similarity search in genomic sequence databases using landmark-guided embedding

  • Author

    Sacan, Ahmet ; Toroslu, I. Hakki

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH
  • fYear
    2008
  • fDate
    7-12 April 2008
  • Firstpage
    338
  • Lastpage
    345
  • Abstract
    Similarity search in sequence databases is of paramount importance in bioinformatics research. As the size of the genomic databases increases, similarity search of proteins in these databases becomes a bottle-neck in large-scale studies, calling for more efficient methods of content-based retrieval. In this study, we present a metric-preserving, landmark-guided embedding approach to represent sequences in the vector domain in order to allow efficient indexing and similarity search. We analyze various properties of the embedding and show that the approximation achieved by the embedded representation is sufficient to achieve biologically relevant results. The approximate representation is shown to provide several orders of magnitude speed-up in similarity search compared to the exact representation, while maintaining comparable search accuracy.
  • Keywords
    biology computing; content-based retrieval; database indexing; genetics; proteins; sequences; approximate similarity search; bioinformatics research; content-based retrieval; genomic sequence database; indexing; landmark-guided embedding approach; proteins; vector domain; Bioinformatics; Computer science; Data engineering; Databases; Genomics; Indexing; Large-scale systems; Matrices; Proteins; Sequences;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering Workshop, 2008. ICDEW 2008. IEEE 24th International Conference on
  • Conference_Location
    Cancun
  • Print_ISBN
    978-1-4244-2161-9
  • Electronic_ISBN
    978-1-4244-2162-6
  • Type

    conf

  • DOI
    10.1109/ICDEW.2008.4498343
  • Filename
    4498343