• DocumentCode
    1018040
  • Title

    Asymptotic properties of data compression and suffix trees

  • Author

    Szpankowski, Wojciech

  • Author_Institution
    Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
  • Volume
    39
  • Issue
    5
  • fYear
    1993
  • fDate
    9/1/1993 12:00:00 AM
  • Firstpage
    1647
  • Lastpage
    1659
  • Abstract
    Recently, Wyner and Ziv (see ibid., vol.35, p.1250-8, 1989) have proved that the typical length of a repeated subword found within the first n positions of a stationary ergodic sequence is (1/h) log n in probability where h is the entropy of the alphabet. This finding was used to obtain several insights into certain universal data compression schemes, most notably the Lempel-Ziv data compression algorithm. Wyner and Ziv have also conjectured that their result can be extended to a stronger almost sure convergence. In this paper, we settle this conjecture in the negative in the so called right domain asymptotic, that is, during a dynamic phase of expanding the data base. We prove-under an additional assumption involving mixing conditions-that the length of a typical repeated subword oscillates almost surely (a.s.) between (1/h1)log n and (1/h2)log n where D<h 2<h⩽h1<∞. We also show that the length of the nth block in the Lempel-Ziv parsing algorithm reveals a similar behavior. We relate our findings to some problems on digital trees, namely the asymptotic behavior of a (noncompact) suffix tree built from suffixes of a random sequence. We prove that the height and the shortest feasible path in a suffix tree are typically (1/h2 )log n (a.s.) and (1/h1)log n (a.s.) respectively
  • Keywords
    convergence; data compression; entropy; information theory; probability; trees (mathematics); Lempel-Ziv parsing algorithm; almost sure convergence; alphabet entropy; data compression; digital trees; mixing conditions; probability; random sequence; repeated subword; right domain asymptotic; shortest feasible path; stationary ergodic sequence; suffix trees; universal data compression; Algorithm design and analysis; Codes; Computer science; Convergence; Data compression; Entropy; Face detection; Hydrogen; Pattern matching; Random sequences;
  • fLanguage
    English
  • Journal_Title
    Information Theory, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9448
  • Type

    jour

  • DOI
    10.1109/18.259648
  • Filename
    259648