DocumentCode :
2457813
Title :
Predicting Approximate Protein-DNA Binding Cores Using Association Rule Mining
Author :
Wong, Po-Yuen ; Chan, Tak-Ming ; Wong, Man-Hon ; Leung, Kwong-Sak
Author_Institution :
Dept. of Comput. Sci. & Eng., Chinese Univ. of Hong Kong, Hong Kong, China
fYear :
2012
fDate :
1-5 April 2012
Firstpage :
965
Lastpage :
976
Abstract :
The studies of protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) are important bioinformatics topics. High-resolution (length<;10) TF-TFBS binding cores are discovered by expensive and time-consuming 3D structure experiments. Recent association rule mining approaches on low-resolution binding sequences (TF length>;490) are shown promising in identifying accurate binding cores without using any 3D structures. While the current association rule mining method on this problem addresses exact sequences only, the most recent ad hoc method for approximation does not establish any formal model and is limited by experimentally known patterns. As biological mutations are common, it is desirable to formally extend the exact model into an approximate one. In this paper, we formalize the problem of mining approximate protein-DNA association rules from sequence data and propose a novel efficient algorithm to predict protein-DNA binding cores. Our two-phase algorithm first constructs two compact intermediate structures called frequent sequence tree (FS-Tree) and frequent sequence class tree (FSCTree). Approximate association rules are efficiently generated from the structures and bioinformatics concepts (position weight matrix and information content) are further employed to prune meaningless rules. Experimental results on real data show the performance and applicability of the proposed algorithm.
Keywords :
DNA; bioinformatics; data mining; matrix algebra; proteins; trees (mathematics); 3D structure experiment; FS-Tree; FSCTree; TFBS; approximate protein-DNA binding cores; association rule mining; bioinformatics; biological mutation; compact intermediate structure; frequent sequence class tree; frequent sequence tree; high-resolution binding cores; information content; low-resolution binding sequences; position weight matrix; protein-DNA association rules; transcription factor binding sites; Approximation methods; Association rules; Databases; Proteins; Pulse width modulation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2012 IEEE 28th International Conference on
Conference_Location :
Washington, DC
ISSN :
1063-6382
Print_ISBN :
978-1-4673-0042-1
Type :
conf
DOI :
10.1109/ICDE.2012.86
Filename :
6228148
Link To Document :
بازگشت