Title :
A Cluster Refinement Algorithm for Motif Discovery
Author :
Li, Gang ; Chan, Tak-Ming ; Leung, Kwong-Sak ; Lee, Kin-Hong
Author_Institution :
Dept. of Comput. Sci. & Eng., Chinese Univ. of Hong Kong, Shatin, China
Abstract :
Finding Transcription Factor Binding Sites, i.e., motif discovery, is crucial for understanding the gene regulatory relationship. Motifs are weakly conserved and motif discovery is an NP-hard problem. We propose a new approach called Cluster Refinement Algorithm for Motif Discovery (CRMD). CRMD employs a flexible statistical motif model allowing a variable number of motifs and motif instances. CRMD first uses a novel entropy-based clustering to find complete and good starting candidate motifs from the DNA sequences. CRMD then employs an effective greedy refinement to search for optimal motifs from the candidate motifs. The refinement is fast, and it changes the number of motif instances based on the adaptive thresholds. The performance of CRMD is further enhanced if the problem has one occurrence of motif instance per sequence. Using an appropriate similarity test of motifs, CRMD is also able to find multiple motifs. CRMD has been tested extensively on synthetic and real data sets. The experimental results verify that CRMD usually outperforms four other state-of-the-art algorithms in terms of the qualities of the solutions with competitive computing time. It finds a good balance between finding true motif instances and screening false motif instances, and is robust on problems of various levels of difficulty.
Keywords :
DNA; biology computing; computational complexity; entropy; genetics; genomics; greedy algorithms; statistical analysis; CRMD; Cluster Refinement Algorithm for Motif Discovery; DNA sequences; NP-hard problem; adaptive thresholds; entropy-based clustering; flexible statistical motif model; gene regulatory relationship; greedy refinement; transcription factor binding sites; Biology computing; Clustering algorithms; DNA; Databases; Evolution (biology); NP-hard problem; Organisms; Robustness; Sequences; Testing; Transcription factor binding site; motif discovery.; Algorithms; Binding Sites; Cluster Analysis; Computational Biology; Entropy; Regulatory Elements, Transcriptional; Sequence Analysis, DNA; Transcription Factors;
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
DOI :
10.1109/TCBB.2009.25