ADtrees for sequential data and n-gram Counting

Author

Dam, Rob Van ; Ventura, Dan

Author_Institution

Brigham Young Univ., Provo

fYear

2007

fDate

7-10 Oct. 2007

Firstpage

492

Lastpage

497

Abstract

We consider the problem of efficiently storing n- gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the AD tree. Here, we adapt the AD tree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.

Keywords

hidden Markov models; natural language processing; tree data structures; ADtrees; hidden Markov models; n-gram count storing; natural language processing; sequential data; very large natural language corpora; Hidden Markov models; Logic programming; Natural language processing; Natural languages; Programmable logic arrays; Smoothing methods; Speech processing; Statistics; System performance; Tagging;

fLanguage

English

Publisher

ieee

Conference_Titel

Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on

Conference_Location

Montreal, Que.

Print_ISBN

978-1-4244-0990-7

Electronic_ISBN

978-1-4244-0991-4

Type

conf

DOI

10.1109/ICSMC.2007.4413704

Filename

4413704