Title :
Detecting homogeneity in protein sequence clusters for automatic functional annotation and noise detection
Author_Institution :
Graduate Sch. of Biotechnol. & Bioinformatics, Yuan Ze Univ., Chung-Li, Taiwan
Abstract :
Protein sequence clustering is a process that aims to identify sets of homologous proteins in a protein database (Kriventseva et al., 2001). The information derived from protein sequence clustering is then widely used for further analysis such as protein family discovery, function prediction, and database compression. For some applications of protein sequence clustering, it is highly desirable that a hierarchical structure, also referred to as dendrogram, which shows how proteins are clustered at various levels is generated. However, it is not an easy task to decide the boundary of natural clusters that correspond to protein families. According to our previous studies, the weighted average precision of the homogeneous clusters in the hierarchy of 41.0 Swiss-Prot database is 98.5% (Chen et al., 2004). Our experimental results show that there are 2158 protein families getting its best matching rate on a homogeneous cluster, among which the biggest one contains 293 proteins. This result shows that many protein families possess the homogeneity property on their sequences. Those 2158 best matched clusters deliver a weighted average precision of 97.34% and a weighted average recall of 91.41%.
Keywords :
biology computing; pattern clustering; proteins; sequences; automatic functional annotation; automatic noise detection; database compression; dendrogram; function prediction; homologous proteins; protein database; protein family discovery; protein sequence clustering; Bioinformatics; Biotechnology; Clustering algorithms; Educational institutions; Gaussian distribution; Information analysis; Protein sequence; Spatial databases; Statistical distributions; Testing;
Conference_Titel :
Emerging Information Technology Conference, 2005.
Print_ISBN :
0-7803-9328-7
DOI :
10.1109/EITC.2005.1544342