Author_Institution :
Dept. of Comput. Sci., Rutgers Univ., Piscataway, NJ, USA
Abstract :
Hierarchical dictionary-based compression schemes form a grammar for a text by replacing each repeated string with a production rule. While such schemes usually operate on-line, making a replacement as soon as repetition is detected, off-line operation permits greater freedom in choosing the order of replacement. In this paper, we compare the on-line method with three off-line heuristics for selecting the next substring to replace: longest string first, most common string first, and the string that minimizes the size of the grammar locally. Surprisingly, two of the off-line techniques, like the on-line method, run in time linear in the size of the input. We evaluate each technique on artificial and natural sequences. In general, the locally-most-compressive heuristic performs best, followed by most frequent, the on-line technique, and, lagging by some distance, the longest-first technique.
Keywords :
data compression; grammars; knowledge based systems; sequences; artificial sequences; dictionary-based compression schemes; grammar; hierarchies of repetitions; locally-most-compressive heuristic; natural sequences; off-line heuristics; on-line heuristics; production rule; Compression algorithms; Computer science; Dictionaries; Frequency; Hardware; Probability distribution;