Title :
A flexible compressed text retrieval system using a modified LZW algorithm
Author :
Zhang, Nan ; Tao, Tao ; Satya, Ravi Vijaya ; Mukherjee, Amar
Author_Institution :
Sch. of Comput. Sci., Central Florida Univ., Orlando, FL, USA
Abstract :
Summary form only given. With an increasing amount of text data being stored in compressed format, being able to access the compressed data randomly and decode it partially is highly desirable for efficient retrieval in many applications. The efficiency of these operations depends on the compression method used. We present a modified LZW algorithm that supports efficient indexing and searching on compressed files. Our method performs in a sublinear complexity, since we only decode a small portion of the file. The proposed approach not only provides the flexibility for dynamic indexing in different text granularities, but also provides the possibility for parallel processing in both encoding and decoding sides, independent of the number of processors available. It also provides good error resilience. The compression ratio is improved using the proposed modified LZW algorithm. Test results show that our public trie method has a compression ratio of 0.34 for the TREC corpus and 0.32 with text preprocessing using a star transform with an optimal static dictionary; this is very close to the efficient word Huffman and phrase based word Huffman schemes, but has a more flexible random access ability.
Keywords :
Huffman codes; computational complexity; data compression; decoding; parallel processing; pattern classification; query formulation; text analysis; compressed file indexing; compressed file searching; compression ratio; data compression; flexible compressed text retrieval system; modified LZW algorithm; parallel processing; partial decoding; random access ability; star transform; static dictionary; sublinear complexity; text preprocessing; word Huffman schemes; Computer errors; Computer science; Decoding; Dictionaries; Encoding; Indexing; Information retrieval; Libraries; Parallel processing; Testing;
Conference_Titel :
Data Compression Conference, 2005. Proceedings. DCC 2005
Print_ISBN :
0-7695-2309-9