DocumentCode
20910
Title
An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets
Author
Qiang Yu ; Hongwei Huo ; Xiaoyang Chen ; Haitao Guo ; Vitter, Jeffrey Scott ; Jun Huan
Author_Institution
Sch. of Comput. Sci. & Technol., Xidian Univ., Xi´an, China
Volume
14
Issue
5
fYear
2015
fDate
Jul-15
Firstpage
535
Lastpage
544
Abstract
The planted (l,d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l,d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l,d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l,d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l,d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets. MCES is freely available at http://sites.google.com/site/feqond/mces.
Keywords
DNA; biological techniques; biology computing; molecular biophysics; F-motif; MCES; MapReduce-based strategy; TraverStringsR; large DNA data sets; motif discovery algorithm; Algorithm design and analysis; Clustering algorithms; DNA; Data mining; Dispersion; Nanobioscience; Pulse width modulation; ChIP-seq; MapReduce; emerging substrings; motif discovery;
fLanguage
English
Journal_Title
NanoBioscience, IEEE Transactions on
Publisher
ieee
ISSN
1536-1241
Type
jour
DOI
10.1109/TNB.2015.2421340
Filename
7083766
Link To Document