Title :
Weighted amino acid composition based on amino acid indices for prediction of protein structural classes
Author :
Nanuwa, Sundeep Singh ; Dziurla, André ; Seker, Huseyin
Author_Institution :
Dept. of Inf., De Montfort Univ., Leicester, UK
Abstract :
Prediction of protein structural classes is one of the most important and challenging tasks in the bioinformatics field. A protein is classified into one of the four main types of protein structural classes; all-¿, all-Ã, ¿/à and ¿+Ã. This paper investigates the role of amino acid indices (AAI) combined with traditional amino acid composition (AAC) to create a weighted amino acid composition (WAAC) feature-set to predict the structural class of a protein. There are over 500 amino acid indices that can be used to develop the novel weighted amino acid composition feature-set which has a great potential of increasing accuracy for the prediction of protein structural classes. For evaluation of these indices a high quality 40% homology dataset is used that contains over 7000 protein sequences (the largest of its kind) extracted from proteomic databases. The predictive technique developed is an optimum k-nearest-neighbour classifier, named multiple-k-nearest-neighbour (MKNN). In order to evaluate the classifier a 10- fold cross-validation test procedure is used throughout the study. Over 1 million analyses were carried out, the highest accuracy obtained was from index LEVM780101 at 48.35%, which is 9% higher than traditional AAC and 6.6% higher than that of the best sequence-driven-feature sub-set used in other studies. There is great potential for further improvement as WAAC is a feature-set with the least number of attributes without any feature selection and the numbers of indices that yielded higher accuracies than traditional AAC and other sequence-driven-features are 536 and 435, respectively, out of the 548 amino acid indices analysed in this study.
Keywords :
bioinformatics; feature extraction; molecular biophysics; molecular configurations; pattern classification; proteins; proteomics; LEVM780101; amino acid indices; bioinformatics; cross-validation test procedure; feature selection; homology dataset; multiple-k-nearest-neighbour; optimum k-nearest-neighbour classifier; protein sequences; protein structural class prediction; proteomic databases; weighted amino acid composition; Accuracy; Amino acids; Bioinformatics; Drugs; Informatics; Information technology; Protein engineering; Proteomics; Spatial databases; Testing; ASTRAL; Amino acid scales; LEVM780101; multiple k-nearest-neighbour; pseudo amino acid composition; weighted amino acid composition;
Conference_Titel :
Information Technology and Applications in Biomedicine, 2009. ITAB 2009. 9th International Conference on
Conference_Location :
Larnaca
Print_ISBN :
978-1-4244-5379-5
Electronic_ISBN :
978-1-4244-5379-5
DOI :
10.1109/ITAB.2009.5394398