Title :
Computing preset dictionaries from text corpora for the compression of messages
Author :
Abel, Marc W. ; Chung, Soon M.
Author_Institution :
Dept. of Comput. Sci. & Eng., Wright State Univ., Dayton, OH, USA
Abstract :
Rigid length limits of short messages greatly restrict users\´ ability to express ideas intelligibly. While data compression can help by enabling greater expressivity in short messages, most work in compression has focused on managing large streams of data instead of small ones. We investigated the potential for preset dictionaries to unleash zlib\´s ability to compress short messages typical of Short Message Service (SMS) texts, microblog updates, and other single-packet transactions. This paper proposes two preset dictionary generation methods and reports strong test results across two dissimilar text corpora: the Enron database of email messages, and the IEEE VAST Challenge 2011 microblog corpus. For exchanges in English using our proposed methods, it is possible to extend "tweets" from 140 to 197 septets on average, and to extend SMS texts from 160 to 227 septets on average. The preset dictionary\´s role is as important as zlib\´s, and each requires the other to obtain these gains.
Keywords :
data compression; dictionaries; electronic mail; electronic messaging; mobile computing; Enron database of email messages; IEEE VAST Challenge 2011 microblog corpus; SMS texts; data compression; microblog updates; preset dictionaries; preset dictionary generation methods; short message compression; short message service texts; single-packet transactions; text corpora; zlib; Message compression; SMS messages; performance analysis; preset dictionary; zlib;
Conference_Titel :
Data and Software Engineering (ICODSE), 2014 International Conference on
Print_ISBN :
978-1-4799-8175-5
DOI :
10.1109/ICODSE.2014.7062490