• DocumentCode
    2541189
  • Title

    ADtrees for sequential data and n-gram Counting

  • Author

    Dam, Rob Van ; Ventura, Dan

  • Author_Institution
    Brigham Young Univ., Provo
  • fYear
    2007
  • fDate
    7-10 Oct. 2007
  • Firstpage
    492
  • Lastpage
    497
  • Abstract
    We consider the problem of efficiently storing n- gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the AD tree. Here, we adapt the AD tree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.
  • Keywords
    hidden Markov models; natural language processing; tree data structures; ADtrees; hidden Markov models; n-gram count storing; natural language processing; sequential data; very large natural language corpora; Hidden Markov models; Logic programming; Natural language processing; Natural languages; Programmable logic arrays; Smoothing methods; Speech processing; Statistics; System performance; Tagging;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
  • Conference_Location
    Montreal, Que.
  • Print_ISBN
    978-1-4244-0990-7
  • Electronic_ISBN
    978-1-4244-0991-4
  • Type

    conf

  • DOI
    10.1109/ICSMC.2007.4413704
  • Filename
    4413704