A New Alignment-Independent Algorithm for Clustering Protein Sequences

Author

Kelil, Abdellali ; Wang, Shengrui ; Brzezinski, Ryszard

Author_Institution

Sherbrooke Univ., Sherbrooke

fYear

2007

fDate

14-17 Oct. 2007

Firstpage

27

Lastpage

34

Abstract

The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important, the challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as multi-domain, circular-permutation and tandem-repeats protein sequences, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-independent algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families.

Keywords

biology computing; molecular biophysics; proteins; alignment-independent algorithm; amino acid subsequences; clustering protein sequences; substitution matching similarity; Algorithm design and analysis; Amino acids; Biological system modeling; Biology computing; Clustering algorithms; Databases; Evolution (biology); Phylogeny; Protein engineering; Protein sequence; Biological function; Clustering; Non-alignable; Phylogeny; Protein sequences; component;

fLanguage

English

Publisher

ieee

Conference_Titel

Bioinformatics and Bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on

Conference_Location

Boston, MA

Print_ISBN

978-1-4244-1509-0

Type

conf

DOI

10.1109/BIBE.2007.4375541

Filename

4375541