• DocumentCode
    19360
  • Title

    A Genomic Distance for Assembly Comparison Based on Compressed Maximal Exact Matches

  • Author

    Garcia, S.P. ; Rodrigues, J.M.O.S. ; Santos, Sara ; Pratas, Diogo ; Afreixo, V. ; Bastos, C.A.C. ; Ferreira, P.J.S.G. ; Pinho, Armando J.

  • Author_Institution
    Signal Process. Lab., Univ. of Aveiro, Aveiro, Portugal
  • Volume
    10
  • Issue
    3
  • fYear
    2013
  • fDate
    May-June 2013
  • Firstpage
    793
  • Lastpage
    798
  • Abstract
    Genome assemblies are typically compared with respect to their contiguity, coverage, and accuracy. We propose a genome-wide, alignment-free genomic distance based on compressed maximal exact matches and suggest adding it to the benchmark of commonly used assembly quality metrics. Maximal exact matches are perfect repeats, without gaps or misspellings, which cannot be further extended to either their left- or right-end side without loss of similarity. The genomic distance here proposed is based on the normalized compression distance, an information-theoretic measure of the relative compressibility of two sequences estimated using multiple finite-context models. This measure exposes similarities between the sequences, as well as, the nesting structure underlying the assembly of larger maximal exact matches from smaller ones. We use four human genome assemblies for illustration and discuss the impact of genome sequencing and assembly in the final content of maximal exact matches and the genomic distance here proposed.
  • Keywords
    bioinformatics; genomics; alignment-free genomic distance; assembly quality metrics; compressed maximal exact matches; genome sequences; genome sequencing; human genome assemblies; information-theoretic measure; multiple finite-context models; nesting structure; normalized compression distance; perfect repeats; relative compressibility; Assembly; Bioinformatics; Computational biology; Genomics; Materials; Sequential analysis; Assembly; Bioinformatics; Computational biology; Genome sequencing and assembly; Genomics; Materials; Sequential analysis; alignment-free genomic distance; assembly quality metrics; bioinformatics; compressed maximal exact matches; genome sequences; genome sequencing; genomics; human genome assemblies; information-theoretic measure; maximal exact matches; multiple finite-context models; nesting structure; normalized compression distance; perfect repeats; relative compressibility;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2013.77
  • Filename
    6552202