A machine learning approach to identify DNA replication proteins from sequence-derived features

Author

Runtao Yang ; Chengjin Zhang ; Rui Gao ; Lina Zhang

Author_Institution

Sch. of Control Sci. & Eng., Shandong Univ., Jinan, China

fYear

2015

fDate

3-6 May 2015

Firstpage

13

Lastpage

18

Abstract

DNA replication, a critical step in cell division and proliferation, is a process of producing two identical replicas from one original DNA molecule. Although great advances have been made in DNA replication research, the detailed mechanism of DNA replication is still unresolved. Faithful DNA replication requires the cooperation of many proteins. Failures in DNA replication leave mutations in the genome, which can cause cancers and other diseases. Therefore, accurately identifying these important DNA replication proteins may assist in understanding the molecular mechanisms of DNA replication and drug development. As the experimental methods are expensive and labor intensive, it is highly desired to develop an accurate computational method for identifying DNA replication proteins. In this paper, a machine learning approach to identify DNA replication proteins has been developed using a Naïve Bayes classifier and sequence-derived features. The prediction performance of features extracted from the Reduced Amino Acid Composition (RAAC) and two Pseudo Amino Acid Composition (PseAAC) models is investigated, respectively. Prediction results indicate that the PseAAC (type 2) model yields the best performance. Then, based on the PseAAC (type 2) model, we compare our method with the similarity search method on the independent test dataset. The comparison results reveal that it is feasible to identify DNA replication proteins by machine learning algorithms. The proposed method may provide candidate DNA replication proteins for future experimental verification to assist in understanding the molecular mechanisms of DNA replication and drug development for the treatment of human diseases.

Keywords

Bayes methods; DNA; biology computing; drugs; genetics; learning (artificial intelligence); proteins; DNA molecule; DNA replication protein; cell division; drug development; genome; machine learning; molecular mechanism; naive Bayes classifier; pseudo amino acid composition; reduced amino acid composition; sequence-derived feature; Accuracy; Amino acids; DNA; Diseases; Feature extraction; Proteins; Sensitivity;

fLanguage

English

Publisher

ieee

Conference_Titel

Electrical and Computer Engineering (CCECE), 2015 IEEE 28th Canadian Conference on

Conference_Location

Halifax, NS

ISSN

0840-7789

Print_ISBN

978-1-4799-5827-6

Type

conf

DOI

10.1109/CCECE.2015.7129092

Filename

7129092