• DocumentCode
    2945186
  • Title

    Sequence Similarity by Gapped LZW

  • Author

    Apostolico, Alberto ; Cunial, Fabio

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2011
  • fDate
    29-31 March 2011
  • Firstpage
    343
  • Lastpage
    352
  • Abstract
    Measures of sequence similarity based on some underlying notion of relative compressibility are becoming increasingly of interest in connection with massive tasks of textfile classification such as, notably, in document classification and molecular taxonomy on a genomic scale. Sequences that are similar can be expected to share a large number of common substrings, whence some successful measures in this class have been based on the substring composition of the input sequences. Among the corresponding methods, one finds suitable extensions of the bag-of-words together with more explicit resorts to data compression techniques such as LZ77. The approach presented in this paper explores the potential of LZW - the variant of LZ78 proposed by Welch -- as well as of some of its lossy variants, in this context. Whereas LZW has a faster and simpler implementation than LZ77, the vocabulary underlying LZW is significantly smaller than that of LZ77. In addition, recently introduced "gapped" variants of LZW are considered that are equally straightforward to implement but allow for a controlled number of don\´t cares to be introduced in the substrings that constitute the dictionary used in compression. This study assesses the robustness of compressibility based measures of similarity under these faster yet inherently more dispersive paradigms built around LZW.
  • Keywords
    data compression; encoding; LZ77; LZ78; Lempel-Ziv-Welchcompression; bag-of-words; document classification; gapped LZW; molecular taxonomy; relative compressibility; sequence similarity; textfile classification; Dictionaries; Encoding; Frequency measurement; Length measurement; Loss measurement; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference (DCC), 2011
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    978-1-61284-279-0
  • Type

    conf

  • DOI
    10.1109/DCC.2011.41
  • Filename
    5749492