• DocumentCode
    147022
  • Title

    Fully Online Grammar Compression in Constant Space

  • Author

    Maruyama, Shoichi ; Tabei, Yasuo

  • Author_Institution
    Preferred Infrastruct., Inc., Japan
  • fYear
    2014
  • fDate
    26-28 March 2014
  • Firstpage
    173
  • Lastpage
    182
  • Abstract
    We present novel variants of fully online LCA (FOLCA), a fully online grammar compression that builds a straight line program (SLP) and directly encodes it into a succinct representation in an online manner. FOLCA enables a direct encoding of an SLP into a succinct representation that is asymptotically equivalent to an information theoretic lower bound for representing an SLP (Maruyama et al., SPIRE´13). The compression of FOLCA takes linear time proportional to the length of an input text and its working space depends only on the size of the SLP, which enables us to apply FOLCA to large-scale repetitive texts. Recent repetitive texts, however, include some noise. For example, current sequencing technology has significant error rates, which embeds noise into genome sequences. For such noisy repetitive texts, FOLCA working in the SLP size consumes a large amount of memory. We present two variants of FOLCA working in constant space by leveraging the idea behind stream mining techniques. Experiments using 100 human genomes corresponding to about 300GB from the 1000 human genomes project revealed the applicability of our method to large-scale, noisy repetitive texts.
  • Keywords
    data compression; grammars; FOLCA; SLP; constant space; fully online LCA; fully online grammar compression; human genomes; information theoretic lower bound; online manner; straight line program; stream mining techniques; Bioinformatics; Dictionaries; Genomics; Grammar; Noise; Production; Vegetation; Grammar compression; Large-scale text; Scalable algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference (DCC), 2014
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Type

    conf

  • DOI
    10.1109/DCC.2014.69
  • Filename
    6824425