• DocumentCode
    3265310
  • Title

    Compressing semi-structured text using hierarchical phrase identifications

  • Author

    Manning, Craig G Nevill ; Witten, Ian H. ; Olsen, Dan R., Jr.

  • Author_Institution
    Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
  • fYear
    1996
  • fDate
    Mar/Apr 1996
  • Firstpage
    63
  • Lastpage
    72
  • Abstract
    This paper takes a compression scheme that infers a hierarchical grammar from its input, and investigates its application to semi-structured text. Although there is a huge range and variety of data that comes within the ambit of “semi-structured”, we focus attention on a particular, and very large, example of such text. Consequently the work is a case study of the application of grammar-based compression to a large-scale problem. We begin by identifying some characteristics of semi-structured text that have special relevance to data compression. We then give a brief account of a particular large textual database, and describe a compression scheme that exploits its structure. In addition to providing compression, the system gives some insight into the structure of the database. Finally we show how the hierarchical grammar can be generalized, first manually and then automatically, to yield further improvements in compression performance
  • Keywords
    data compression; data structures; database management systems; grammars; large-scale systems; word processing; compression performance; data compression; database structure; grammar based compression; hierarchical grammar; hierarchical phrase identifications; large textual database; large-scale problem; semistructured text compression; Compression algorithms; Computer science; Costs; Databases; Globalization; Humans; Indexing; Information retrieval; SGML; Skeleton;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 1996. DCC '96. Proceedings
  • Conference_Location
    Snowbird, UT
  • ISSN
    1068-0314
  • Print_ISBN
    0-8186-7358-3
  • Type

    conf

  • DOI
    10.1109/DCC.1996.488311
  • Filename
    488311