• DocumentCode
    1784886
  • Title

    An efficient motif finding algorithm for large DNA data sets

  • Author

    Qiang Yu ; Hongwei Huo ; Xiaoyang Chen ; Haitao Guo ; Vitter, Jeffrey Scott ; Jun Huan

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Xidian Univ., Xi´an, China
  • fYear
    2014
  • fDate
    2-5 Nov. 2014
  • Firstpage
    397
  • Lastpage
    402
  • Abstract
    The planted (l, d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l, d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l, d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l, d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l, d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets.
  • Keywords
    DNA; bioinformatics; data mining; molecular biophysics; molecular configurations; ChIP-seq; DNA data sets; F-motif; MCES; MapReduce-based strategy; TraverStringsR; competing algorithm CisFinder; data mining; efficient motif finding algorithm; identification accuracy; input sequences; next-generation sequencing data sets; planted (l,d) motif discovery algorithm; promoter sequences; simulated data; state-of-the-art (l,d) motif discovery algorithms; transcription factor binding sites; Accuracy; Algorithm design and analysis; Clustering algorithms; DNA; Data mining; Dispersion; Pulse width modulation; ChIP-seq; MapReduce; Motif discovery; emerging substrings;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
  • Conference_Location
    Belfast
  • Type

    conf

  • DOI
    10.1109/BIBM.2014.6999191
  • Filename
    6999191