• DocumentCode
    3061562
  • Title

    A general-purpose compression scheme for databases

  • Author

    Cannane, Adam ; Williams, Hugh E. ; Zobel, Justin

  • Author_Institution
    Dept. of Comput. Sci., R. Melbourne Inst. of Technol. Univ., Vic., Australia
  • fYear
    1999
  • fDate
    29-31 Mar 1999
  • Firstpage
    519
  • Abstract
    Summary form only given. Current adaptive compression schemes such as GZIP and COMPRESS are impractical for database compression as they do not allow random access to individual records. A compression algorithm for general-purpose database systems must address the problem of randomly accessing and individually decompressing records, while maintaining compact storage of data. The SEQUITUR algorithm of Nevill-Manning et al., (1994, 1996, 1997) also adaptively compresses data, achieving excellent compression but with significant main-memory requirements. A preliminary version of SEQUITUR used a semi-static modelling approach to achieve slightly worse compression than the adaptive approach. We describe a new variant of the semi-static SEQUITUR algorithm, RAY, that reduces main-memory use and allows random-access to databases. RAY models repetition in sequences by progressively constructing a hierarchical grammar with multiple passes through the data. The multiple pass approach of RAY uses statistics on character pair repetition, or digram frequency, to create rules in the grammar. While our preliminary implementation is not especially fast, the multi-pass approach permits reductions in compression time, at the cost of affecting compression performance, by limiting the number of passes. We have found that RAY has practicable main-memory requirements and achieves better compression than an efficient Huffmann scheme and popular adaptive compression techniques. Moreover, our scheme allows random access to data and is not restricted to databases of text
  • Keywords
    data compression; database management systems; information retrieval; sequences; RAY; SEQUITUR algorithm; adaptive compression; character pair repetition; data compression; database compression; digram frequency; hierarchical grammar; main-memory requirements; multiple pass approach; random access; sequence repetition; Compression algorithms; Computer science; Costs; Data compression; Database systems; Frequency; IEEE Computer Society Press; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 1999. Proceedings. DCC '99
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-0096-X
  • Type

    conf

  • DOI
    10.1109/DCC.1999.785676
  • Filename
    785676