DocumentCode :
3012952
Title :
Two-Level Dictionary-Based Text Compression Scheme
Author :
Zia, Md Ziaul Karim ; Rahman, Dewan Md Fayzur ; Rahman, Chowdhury Mofizur
Author_Institution :
United Int. Univ., Dhaka
fYear :
2008
fDate :
24-27 Dec. 2008
Firstpage :
13
Lastpage :
18
Abstract :
In this paper a new dictionary and memory based text compression technique is presented called a two-level dictionary based text compression scheme. The original words in a text file are transformed into codewords having length 2 and 3 using a dictionary comprising 73680 frequently used words in English language. Among these words most frequently used words use 2 length codewords and the rest use 3 length codewords for better compression. The codewords are chosen in such way that the spaces between words in the original text file can be removed altogether recovering a substantial amount of space. Another unique feature of our compression scheme is that we have recovered unused bit of ASCII character representation from each character to save one byte per 8 ASCII characters. Lastly a back end existing compression algorithm is used to finally compress the file. We have achieved about 75% (compression ratio of 2.01 bits per input character) reduction in size using our new compression strategy with gzip and bzip2.
Keywords :
data compression; dictionaries; natural language processing; text analysis; ASCII character representation; English language; Huffman code; codeword; memory-based text compression; text file; two-level dictionary-based text compression; word transformation; Compression algorithms; Costs; Data compression; Decoding; Dictionaries; Information technology; Mathematical model; Natural languages; Propagation losses; Runtime; Dictionary based compression; Huffman code; Text compression; word transformation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer and Information Technology, 2008. ICCIT 2008. 11th International Conference on
Conference_Location :
Khulna
Print_ISBN :
978-1-4244-2135-0
Electronic_ISBN :
978-1-4244-2136-7
Type :
conf
DOI :
10.1109/ICCITECHN.2008.4803026
Filename :
4803026
Link To Document :
بازگشت