Title :
Random Forests for Prediction of DNA-Binding Residues in Protein Sequences Using Evolutionary Information
Author :
Wang, Liangjiang
Author_Institution :
Dept. of Genetics & Biochem., Clemson Univ., Clemson, SC, USA
Abstract :
A new machine learning approach has been developed in this study for sequence-based prediction of DNA-binding residues in proteins. The approach used both the labeled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices and several new descriptors. The sequence-derived features were used to train random forests, which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset. The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies.
Keywords :
DNA; biochemistry; biology computing; classification; encoding; evolutionary computation; learning (artificial intelligence); molecular biophysics; proteins; random processes; DNA-binding residues; biochemical features; classifier construction; evolutionary information; input encoding; machine learning approach; position-specific scoring matrices; protein sequences; random forest learning algorithm; separate test dataset; Amino acids; Artificial neural networks; DNA; Encoding; Input variables; Machine learning; Proteins; Sequences; Support vector machine classification; Support vector machines; DNA-binding site prediction; evolutionary information; feature extraction; random forests; semi-supervised learning;
Conference_Titel :
Future Generation Communication and Networking, 2008. FGCN '08. Second International Conference on
Conference_Location :
Hainan Island
Print_ISBN :
978-0-7695-3431-2
DOI :
10.1109/FGCN.2008.92