• DocumentCode
    397263
  • Title

    A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae

  • Author

    Sampath, G.

  • Author_Institution
    Dept. of Comput. Sci., Coll. of New Jersey, Ewing, NJ, USA
  • fYear
    2003
  • fDate
    11-14 Aug. 2003
  • Firstpage
    287
  • Lastpage
    293
  • Abstract
    A simple statistical block code in combination with the LZW-based compression utilities gzip and compress has been found to increase by a significant amount the level of compression possible for the proteins encoded in Haemophilus influenzae, the first fully sequenced genome. The method yields an entropy value of 3.665 bits per symbol (bps), which is 0.657 bps below the maximum of 4.322 bps and an improvement of 0.452 bps over the best known to date of 4.118 bps using Matsumoto, Sadakane, and Imai´s Iza-CTW algorithm. Calculations based on a compact inverse genetic code show that the genome has a maximum entropy of 1.757 bps for the coding regions, with a possibly lower actual entropy. These results hint at the existence of hitherto unexplored redundancies that do not show up in Markov models and are indicative of more internal structure than suspected in both the protein and the genome.
  • Keywords
    biology computing; block codes; data compression; encoding; entropy; genetic algorithms; genetics; hidden Markov models; physiological models; proteins; Haemophilus influenzae coding sections; Imai Iza-CTW algorithm; LZW-based compression; Markov models; Matsumoto algorithm; Sadakane algorithm; block coding method; compact inverse genetic code; entropy values; gzip; proteins; sequenced genome; statistical block code; Bioinformatics; Biological information theory; Biology computing; Block codes; Chemicals; Data mining; Distributed computing; Entropy; Genomics; Proteins;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE
  • Print_ISBN
    0-7695-2000-6
  • Type

    conf

  • DOI
    10.1109/CSB.2003.1227329
  • Filename
    1227329