DocumentCode :
3384588
Title :
A dictionary-based multi-corpora text compression system
Author :
Sun, Weifeng ; Zhang, Nan ; Mukherjee, Amar
Author_Institution :
Dept. of Comput. Sci., Central Florida Univ., Orlando, FL, USA
fYear :
2003
fDate :
25-27 March 2003
Firstpage :
448
Abstract :
Summary form only given. StarZip, a multi-copora text compression system, was introduced together with its transform engine StarNT. One of the key features of the StarZip compression system is to develop domain specific dictionaries and provide tools to develop such dictionaries. StarNT was utilized because it achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. StarNT is a dictionary-based fast lossless text transform. The main idea is to record each English word with a representation of no more than three symbols. This transform maintains most of the original context information at the word level and provides an "artificial" strong context. It ultimately reduces the size of the transformed text that, in turn, is provided to a backend compressor. This data structure provides a very fast transform encoding with a low storage overhead. StarNT also treats the transformed codewords as an offset of words in the transform dictionary. The time complexity for searching a word in the dictionary is achieved in the transform decoder. Experimental results have shown that the average compression time has improved by orders magnitude compared to previous dictionary-based transform LIPT. The complexity and compression performance of bzip2, in conjunction with this transform, is better than both gzip and PPMD. Results from five copora have shown that StarZip achieved an average improvement in compression performance (in terms of BPC) of 13% over bzip2-9, 19% over gzip-9, and 10% over PPMD.
Keywords :
data compression; data structures; dictionaries; text analysis; transform coding; BWT; English word representation; LIPT transform; PPM; PPMD; StarNT transform engine; StarZip text compression system; artificial context; backend compressor; bzip2; codewords; compression ratio; compression time; context information; dictionary-based multi-corpora text compression system; gzip; lossless text transform; storage overhead; time complexity; transform decoder; transform dictionary; transform encoding; word offset; Computer science; Data compression; Data structures; Decoding; Dictionaries; Encoding; Engines; Frequency; Sun;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2003. Proceedings. DCC 2003
ISSN :
1068-0314
Print_ISBN :
0-7695-1896-6
Type :
conf
DOI :
10.1109/DCC.2003.1194067
Filename :
1194067
Link To Document :
بازگشت