• DocumentCode
    2188553
  • Title

    A New Searchable Variable-to-Variable Compressor

  • Author

    Brisaboa, Nieves R. ; Fari, Antonio ; Lopez, Juan R ; Navarro, Gonzalo ; Lopez, Eduardo R

  • Author_Institution
    Database Lab., Univ. of A Coruna, A Coruna, Spain
  • fYear
    2010
  • fDate
    24-26 March 2010
  • Firstpage
    199
  • Lastpage
    208
  • Abstract
    Word-based compression over natural language text has shown to be a good choice to trade compression ratio and speed, obtaining compression ratios close to 30% and very fast decompression. Additionally, it permits fast searches over the compressed text using Boyer-Moore type algorithms. Such compressors are based on processing fixed source symbols (words) and assigning them variable-byte-length codewords, thus following a fixed-to-variable approach. We present a new variable-to-variable compressor (v2vdc) that uses words and phrases as the source symbols, which are encoded with a variable-length scheme. The phrases are chosen using the longest common prefix information on the suffix array of the text, so as to favor long and frequent phrases. We obtain compression ratios close to those of p7zip and ppmdi, overcoming bzip2, and 8-10 percentage points less than the equivalent word-based compressor. V2vdc is in addition among the fastest to decompress, and allows efficient direct search of the compressed text, in some cases the fastest to date as well.
  • Keywords
    data compression; natural language processing; search problems; symbol manipulation; text analysis; variable rate codes; word processing; Boyer-Moore type algorithm; compression ratio; compression speed; fixed-to-variable approach; natural language text; source symbols; text search; variable-byte-length codewords; variable-to-variable compressor; word based compression; word processing; Computer science; Data compression; Decoding; Encoding; Frequency; Gain measurement; Natural languages; Spatial databases; natural language; searchable compressed text; variable-to-variable compression; word-based compression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference (DCC), 2010
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    978-1-4244-6425-8
  • Electronic_ISBN
    1068-0314
  • Type

    conf

  • DOI
    10.1109/DCC.2010.25
  • Filename
    5453462