• DocumentCode
    2708791
  • Title

    A flexible compressed text retrieval system using a modified LZW algorithm

  • Author

    Zhang, Nan ; Tao, Tao ; Satya, Ravi Vijaya ; Mukherjee, Amar

  • Author_Institution
    Sch. of Comput. Sci., Central Florida Univ., Orlando, FL, USA
  • fYear
    2005
  • fDate
    29-31 March 2005
  • Firstpage
    493
  • Abstract
    Summary form only given. With an increasing amount of text data being stored in compressed format, being able to access the compressed data randomly and decode it partially is highly desirable for efficient retrieval in many applications. The efficiency of these operations depends on the compression method used. We present a modified LZW algorithm that supports efficient indexing and searching on compressed files. Our method performs in a sublinear complexity, since we only decode a small portion of the file. The proposed approach not only provides the flexibility for dynamic indexing in different text granularities, but also provides the possibility for parallel processing in both encoding and decoding sides, independent of the number of processors available. It also provides good error resilience. The compression ratio is improved using the proposed modified LZW algorithm. Test results show that our public trie method has a compression ratio of 0.34 for the TREC corpus and 0.32 with text preprocessing using a star transform with an optimal static dictionary; this is very close to the efficient word Huffman and phrase based word Huffman schemes, but has a more flexible random access ability.
  • Keywords
    Huffman codes; computational complexity; data compression; decoding; parallel processing; pattern classification; query formulation; text analysis; compressed file indexing; compressed file searching; compression ratio; data compression; flexible compressed text retrieval system; modified LZW algorithm; parallel processing; partial decoding; random access ability; star transform; static dictionary; sublinear complexity; text preprocessing; word Huffman schemes; Computer errors; Computer science; Decoding; Dictionaries; Encoding; Indexing; Information retrieval; Libraries; Parallel processing; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2005. Proceedings. DCC 2005
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-2309-9
  • Type

    conf

  • DOI
    10.1109/DCC.2005.5
  • Filename
    1402250