DocumentCode :
3048980
Title :
Compressing XML with multiplexed hierarchical PPM models
Author :
Cheney, James
Author_Institution :
Cornell Univ., Ithaca, NY, USA
fYear :
2001
fDate :
2001
Firstpage :
163
Lastpage :
172
Abstract :
We established a working Extensible Markup Language (XML) compression benchmark based on text compression, and found that bzip2 compresses XML best, albeit more slowly than gzip. Our experiments verified that TXMILL speeds up and improves compression using gzip and bounded-context PPM by up to 15%, but found that it worsens the compression for bzip2 and PPM. We describe alternative approaches to XML compression that illustrate other tradeoffs between speed and effectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called Encoded SAX (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XML-conscious encoding based on prediction by partial match (PPM) called multiplexed hierarchical modeling (MHM) that compresses up to 35 % better than any existing method but is fairly slow
Keywords :
adaptive codes; data compression; document image processing; hypermedia markup languages; multiplexing; prediction theory; PPM; XMILL; XML compression; XML-conscious encoding; adaptive encoding; bounded-context PPM; bzip2; encoded SAX; extensible markup language; gzip; multiplexed hierarchical PPM models; multiplexed hierarchical modeling; online binary encoding; online encoding; prediction by partial match; text compression; text compressors; Computer industry; Encoding; Entropy; HTML; Markup languages; SGML; Software systems; Testing; Tree data structures; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2001. Proceedings. DCC 2001.
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-7695-1031-0
Type :
conf
DOI :
10.1109/DCC.2001.917147
Filename :
917147
Link To Document :
بازگشت