• DocumentCode
    2945789
  • Title

    Block-Oriented Dense Compressor

  • Author

    Procházka, Petr ; Holub, Jan

  • Author_Institution
    Dept. of Theor. Comput. Sci., Czech Tech. Univ. in Prague, Prague, Czech Republic
  • fYear
    2011
  • fDate
    29-31 March 2011
  • Firstpage
    472
  • Lastpage
    472
  • Abstract
    The paper address the problem of block-oriented natural language compression. Adaptive and semi-adaptive compression methods are nowadays very common in natural language compression field, each of them with different application possibilities. The block-oriented compression is semi-adaptive in terms of one block but it is adaptive in terms of whole input. Our block-oriented compression method is based on the Dense Code idea. It achieves very good compression ratio around 32 % on natural language text and proved to be very fast in searching on the compressed text. We show that our method has some interesting properties which could be applied on digital libraries. The compression method allows direct searching on compressed text. Moreover the vocabulary can be used as a block index which makes some kinds of searching very fast. Another property is that the compressor can send single blocks with correspond ing vocabulary which is considerate to limited bandwidth. In addition the compressed file can be continuously extended without need of previous decompression.Our block-oriented compression method is called Semi-adaptive Two Byte Dense Code (STBDC) and it is a semi-adaptive version TBDC proposed. The STBDC codeword is composed of one or two bytes. The values of the first byte are so-called stoppers or continuers. In the second byte any combination of the bits is allowed which is the point of the limited coding space. The decomposition of the input text into the blocks is based on the limit of the coding space. The end of block must always come when the coding space given by the number of stoppers is exhausted. The changes between the following blocks are encoded in the dictionary file so the the original dictionary for the corresponding block can be easily and quickly reconstructed.
  • Keywords
    adaptive codes; data compression; natural language processing; text analysis; STBDC codeword; adaptive compression methods; block-oriented natural language compression; coding space; digital library; semiadaptive compression methods; semiadaptive two byte dense code; Adaptation model; Data compression; Dictionaries; Encoding; Natural languages; Real time systems; Vocabulary; Dense Code; Digital Libraries; Natural Language Compression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference (DCC), 2011
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    978-1-61284-279-0
  • Type

    conf

  • DOI
    10.1109/DCC.2011.76
  • Filename
    5749529