DocumentCode :
3421292
Title :
On compressibility of protein sequences
Author :
Adjeroh, Donald ; Nan, Fei
Author_Institution :
Dept. of Comput. Sci. & Electr. Eng., West Virginia Univ., Morgantown, WV
fYear :
2006
fDate :
28-30 March 2006
Lastpage :
434
Abstract :
We consider the problem of compressibility of protein sequences. Based on an observed genome-scale long-range correlation in concatenated protein sequences from different organisms, we propose a method to exploit this unusual redundancy in compressing the protein sequences. The result is a significant reduction in the number of bits required for representing the sequences. We report results in bits per symbol (bps) of 2.27, 2.55, 3.11 and 3.44 for protein sequences from M. jannaschii, H. influenzae, S. cerevisiae, and H. sapiens respectively, the same protein sequences used by Nevill-Manning and Witten in the "Protein is incompressible" paper. The observed long-range correlations could have significant implications beyond compression and complexity analysis of protein sequences
Keywords :
computational complexity; data compression; image coding; image sequences; proteins; complexity analysis; concatenated protein sequences; genome-scale long-range correlation; long-range correlations; protein sequence compression; Amino acids; Bioinformatics; Biological information theory; DNA; Genetics; Genomics; Organisms; Protein engineering; Protein sequence; RNA;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2006. DCC 2006. Proceedings
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-7695-2545-8
Type :
conf
DOI :
10.1109/DCC.2006.56
Filename :
1607277
Link To Document :
بازگشت