• DocumentCode
    34455
  • Title

    An Efficient Search Algorithm for Finding Genomic-Range Overlaps Based on the Maximum Range Length

  • Author

    Ho-Sik Seok ; Taemin Song ; Sek Won Kong ; Kyu-Baek Hwang

  • Author_Institution
    Sch. of Comput. Sci. & Eng., Soongsil Univ., Seoul, South Korea
  • Volume
    12
  • Issue
    4
  • fYear
    2015
  • fDate
    July-Aug. 1 2015
  • Firstpage
    778
  • Lastpage
    784
  • Abstract
    Efficient search algorithms for finding genomic-range overlaps are essential for various bioinformatics applications. A majority of fast algorithms for searching the overlaps between a query range (e.g., a genomic variant) and a set of N reference ranges (e.g., exons) has time complexity of O(k + logN), where kdenotes a term related to the length and location of the reference ranges. Here, we present a simple but efficient algorithm that reduces k, based on the maximum reference range length. Specifically, for a given query range and the maximum reference range length, the proposed method divides the reference range set into three subsets: always, potentially, and never overlapping. Therefore, search effort can be reduced by excluding never overlapping subset. We demonstrate that the running time of the proposed algorithm is proportional to potentially overlapping subset size, that is proportional to the maximum reference range length if all the other conditions are the same. Moreover, an implementation of our algorithm was 13.8 to 30.0 percent faster than one of the fastest range search methods available when tested on various genomic-range data sets. The proposed algorithm has been incorporated into a disease-linked variant prioritization pipeline for WGS (http://gnome.tchlab.org) and its implementation is available at http://ml.ssu.ac.kr/gSearch.
  • Keywords
    bioinformatics; diseases; genomics; N reference ranges; O(k+logN) complexity; bioinformatics applications; disease-linked variant prioritization pipeline; efficient search algorithm; exons; genomic variant; genomic-range data sets; genomic-range overlaps; maximum range length; potentially overlapping subset size; query range; Algorithm design and analysis; Bioinformatics; Computational biology; Genomics; Sequential analysis; Sorting; Time complexity; Genome analysis; RNA sequencing; next-generation sequencing; range search; variant annotation; variant prioritization; whole-genome sequencing;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2014.2369042
  • Filename
    6951414