DocumentCode
147022
Title
Fully Online Grammar Compression in Constant Space
Author
Maruyama, Shoichi ; Tabei, Yasuo
Author_Institution
Preferred Infrastruct., Inc., Japan
fYear
2014
fDate
26-28 March 2014
Firstpage
173
Lastpage
182
Abstract
We present novel variants of fully online LCA (FOLCA), a fully online grammar compression that builds a straight line program (SLP) and directly encodes it into a succinct representation in an online manner. FOLCA enables a direct encoding of an SLP into a succinct representation that is asymptotically equivalent to an information theoretic lower bound for representing an SLP (Maruyama et al., SPIRE´13). The compression of FOLCA takes linear time proportional to the length of an input text and its working space depends only on the size of the SLP, which enables us to apply FOLCA to large-scale repetitive texts. Recent repetitive texts, however, include some noise. For example, current sequencing technology has significant error rates, which embeds noise into genome sequences. For such noisy repetitive texts, FOLCA working in the SLP size consumes a large amount of memory. We present two variants of FOLCA working in constant space by leveraging the idea behind stream mining techniques. Experiments using 100 human genomes corresponding to about 300GB from the 1000 human genomes project revealed the applicability of our method to large-scale, noisy repetitive texts.
Keywords
data compression; grammars; FOLCA; SLP; constant space; fully online LCA; fully online grammar compression; human genomes; information theoretic lower bound; online manner; straight line program; stream mining techniques; Bioinformatics; Dictionaries; Genomics; Grammar; Noise; Production; Vegetation; Grammar compression; Large-scale text; Scalable algorithm;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Compression Conference (DCC), 2014
Conference_Location
Snowbird, UT
ISSN
1068-0314
Type
conf
DOI
10.1109/DCC.2014.69
Filename
6824425
Link To Document