• DocumentCode
    26836
  • Title

    Heterogeneous Compression of Large Collections of Evolutionary Trees

  • Author

    Matthews, Suzanne J.

  • Author_Institution
    Dept. of Electr. Eng. & Comput. Sci., United States Mil. Acad., West Point, NY, USA
  • Volume
    12
  • Issue
    4
  • fYear
    2015
  • fDate
    July-Aug. 1 2015
  • Firstpage
    807
  • Lastpage
    814
  • Abstract
    Compressing heterogeneous collections of trees is an open problem in computational phylogenetics. In a heterogeneous tree collection, each tree can contain a unique set of taxa. An ideal compression method would allow for the efficient archival of large tree collections and enable scientists to identify common evolutionary relationships over disparate analyses. In this paper, we extend TreeZip to compress heterogeneous collections of trees. TreeZip is the most efficient algorithm for compressing homogeneous tree collections. To the best of our knowledge, no other domain-based compression algorithm exists for large heterogeneous tree collections or enable their rapid analysis. Our experimental results indicate that TreeZip averages 89.03 percent (72.69 percent) space savings on unweighted (weighted) collections of trees when the level of heterogeneity in a collection is moderate. The organization of the TRZ file allows for efficient computations over heterogeneous data. For example, consensus trees can be computed in mere seconds. Lastly, combining the TreeZip compressed (TRZ) file with general-purpose compression yields average space savings of 97.34 percent (81.43 percent) on unweighted (weighted) collections of trees. Our results lead us to believe that TreeZip will prove invaluable in the efficient archival of tree collections, and enables scientists to develop novel methods for relating heterogeneous collections of trees.
  • Keywords
    biology computing; evolution (biological); genetic algorithms; genetics; trees (mathematics); TRZ file; TreeZip compressed file; average space savings; common evolutionary relationships; computational phylogenetics; domain-based compression algorithm; general-purpose compression; heterogeneous collection compression; heterogeneous data; heterogeneous tree collection; large evolutionary trees collections; Bioinformatics; Compression algorithms; Computational biology; IEEE transactions; Phylogeny; Special issues and sections; Collections; Compression; Heterogeneity; Heterogeneous; Phylogeny; TreeZip; Trees; collections; compression; heterogeneity; heterogeneous; trees;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2014.2366756
  • Filename
    6945876