Title :
Phrase hierarchy inference and compression in bounded space
Author :
Manning, Graig G Nelvill ; Witten, Ian H.
Author_Institution :
Dept. of Biochem., Stanford Univ., CA, USA
fDate :
30 Mar-1 Apr 1998
Abstract :
Text compression by inferring a phrase hierarchy from the input is a technique that shows promise as a compression scheme and as a machine learning method that extracts some comprehensible account of the structure of the input text. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences have been put to such eclectic uses as phrase browsing in digital libraries, music analysis, and inferring rules for fractal images. We focus attention on the memory requirements of the method. Since the algorithm operates in linear time, the space it consumes is at most linear with input size. The space consumed does in fact grow linearly with the size of the inferred hierarchy, and this makes operation on very large files infeasible. We describe two elegant ways of curtailing the space complexity of hierarchy inference, one of which yields a bounded space algorithm. We begin with a review of the hierarchy inference procedure that is embodied in the SEQUITUR program. Then we consider its performance on quite large files, and show how the compression performance improves as the file size increases
Keywords :
data compression; grammars; inference mechanisms; word processing; SEQUITUR program; bounded space algorithm; compression performance; data compression; dictionary schemes; digital libraries; file size; fractal images; input text structure; large files; linear time algorithm; machine learning method; memory requirements; music analysis; phrase browsing; phrase hierarchy inference; rules; space complexity; text compression; Data compression; Data mining; Dictionaries; Fractals; Image analysis; Image sequence analysis; Inference algorithms; Learning systems; Performance analysis; Software libraries;
Conference_Titel :
Data Compression Conference, 1998. DCC '98. Proceedings
Conference_Location :
Snowbird, UT
Print_ISBN :
0-8186-8406-2
DOI :
10.1109/DCC.1998.672146