Title :
Predict Subcellular Locations of Singleplex and Multiplex Proteins by Semi-Supervised Learning and Dimension-Reducing General Mode of Chou´s PseAAC
Author :
Pacharawongsakda, Eakasit ; Theeramunkong, Thanaruk
Author_Institution :
Sch. of Inf., Comput., & Commun. Technol., Thammasat Univ., Muang, Thailand
Abstract :
Predicting protein subcellular location is one of major challenges in Bioinformatics area since such knowledge helps us understand protein functions and enables us to select the targeted proteins during drug discovery process. While many computational techniques have been proposed to improve predictive performance for protein subcellular location, they have several shortcomings. In this work, we propose a method to solve three main issues in such techniques; i) manipulation of multiplex proteins which may exist or move between multiple cellular compartments, ii) handling of high dimensionality in input and output spaces and iii) requirement of sufficient labeled data for model training. Towards these issues, this work presents a new computational method for predicting proteins which have either single or multiple locations. The proposed technique, namely iFLAST-CORE, incorporates the dimensionality reduction in the feature and label spaces with co-training paradigm for semi-supervised multi-label classification. For this purpose, the Singular Value Decomposition (SVD) is applied to transform the high-dimensional feature space and label space into the lower-dimensional spaces. After that, due to limitation of labeled data, the co-training regression makes use of unlabeled data by predicting the target values in the lower-dimensional spaces of unlabeled data. In the last step, the component of SVD is used to project labels in the lower-dimensional space back to those in the original space and an adaptive threshold is used to map a numeric value to a binary value for label determination. A set of experiments on viral proteins and gram-negative bacterial proteins evidence that our proposed method improve the classification performance in terms of various evaluation metrics such as Aiming (or Precision), Coverage (or Recall) and macro F-measure, compared to the traditional method that uses only labeled data.
Keywords :
bioinformatics; cellular biophysics; feature extraction; learning (artificial intelligence); microorganisms; molecular biophysics; molecular configurations; pattern classification; proteins; regression analysis; singular value decomposition; transforms; Aiming; Chou PseAAC; Coverage; SVD; adaptive threshold; bioinformatics; computational techniques; cotraining paradigm; cotraining regression; dimension-reducing general mode; dimensionality reduction; drug discovery process; gram-negative bacterial proteins; high-dimensional feature space transform; iFLAST-CORE; label spaces; macro F-measure; multiple cellular compartments; multiplex proteins; predict subcellular locations; predictive performance; protein functions; protein subcellular location; semisupervised learning; semisupervised multilabel classification; singleplex proteins; singular value decomposition; sufficient labeled data; viral proteins; Bioinformatics; Proteins; Semisupervised learning; Singular value decomposition; Support vector machines; Training; Co-training; dimensionality reduction; gene ontology; multi-label classification; semi-supervised learning; subcellular location;
Journal_Title :
NanoBioscience, IEEE Transactions on
DOI :
10.1109/TNB.2013.2272014