Title :
Assessing protein function using a combination of supervised and unsupervised learning
Author :
Yang, Jack Y. ; Yang, Mary Qu
Author_Institution :
Dept. of Radiat. Oncology, Harvard Univ., Boston, MA
Abstract :
The determination of protein function using experimental techniques is time-consuming and expensive; the use of machine learning techniques to rapidly assess protein function may be useful in streamlining this process. The problem of assigning functional classes to proteins is complicated by the fact that a single protein can participate in several different pathways and thus can have multiple functions. It follows that the instances in the resulting classification problem can carry multiple class labels. We have developed a tree-based classifier that capable of handling multiply-labeled data: we call the resulting tree a recursive maximum-contrast tree (RMCT). The name derives from the way in which nodes in the tree are split; this is done by selecting the two training instances with maximum contrast (that is, the two training instances with maximum separation according to some distance measure) and using them as seeds in a clustering algorithm to form a partition of the training instances and hence of the feature space. We test our algorithm on protein phylogenetic profiles generated from 60 completely sequenced genomes, and we compare our results to those achieved using existing algorithms such as support vector machines and decision trees
Keywords :
biological techniques; biology computing; data handling; decision trees; genetics; learning (artificial intelligence); molecular biophysics; pattern classification; pattern clustering; proteins; recursive functions; support vector machines; unsupervised learning; clustering algorithm; decision trees; feature space decomposition; machine learning techniques; multiple class labeled data handling; protein function assessment; protein phylogenetic profiles; recursive maximum-contrast tree algorithm; sequenced genomes; supervised learning; support vector machines; training instance partition; tree-based classifier; unsupervised learning; Bioinformatics; Classification tree analysis; Clustering algorithms; Genomics; Machine learning; Partitioning algorithms; Phylogeny; Proteins; Testing; Unsupervised learning;
Conference_Titel :
BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on
Conference_Location :
Arlington, VA
Print_ISBN :
0-7695-2727-2
DOI :
10.1109/BIBE.2006.253313