DocumentCode :
1421326
Title :
On-line and off-line heuristics for inferring hierarchies of repetitions in sequences
Author :
Nevill-manning, Craig G. ; Witten, Ian H.
Author_Institution :
Dept. of Comput. Sci., Rutgers Univ., Piscataway, NJ, USA
Volume :
88
Issue :
11
fYear :
2000
Firstpage :
1745
Lastpage :
1755
Abstract :
Hierarchical dictionary-based compression schemes form a grammar for a text by replacing each repeated string with a production rule. While such schemes usually operate on-line, making a replacement as soon as repetition is detected, off-line operation permits greater freedom in choosing the order of replacement. In this paper, we compare the on-line method with three off-line heuristics for selecting the next substring to replace: longest string first, most common string first, and the string that minimizes the size of the grammar locally. Surprisingly, two of the off-line techniques, like the on-line method, run in time linear in the size of the input. We evaluate each technique on artificial and natural sequences. In general, the locally-most-compressive heuristic performs best, followed by most frequent, the on-line technique, and, lagging by some distance, the longest-first technique.
Keywords :
data compression; grammars; knowledge based systems; sequences; artificial sequences; dictionary-based compression schemes; grammar; hierarchies of repetitions; locally-most-compressive heuristic; natural sequences; off-line heuristics; on-line heuristics; production rule; Compression algorithms; Computer science; Dictionaries; Frequency; Hardware; Probability distribution;
fLanguage :
English
Journal_Title :
Proceedings of the IEEE
Publisher :
ieee
ISSN :
0018-9219
Type :
jour
DOI :
10.1109/5.892710
Filename :
892710
Link To Document :
بازگشت