DocumentCode
2541189
Title
ADtrees for sequential data and n-gram Counting
Author
Dam, Rob Van ; Ventura, Dan
Author_Institution
Brigham Young Univ., Provo
fYear
2007
fDate
7-10 Oct. 2007
Firstpage
492
Lastpage
497
Abstract
We consider the problem of efficiently storing n- gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the AD tree. Here, we adapt the AD tree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.
Keywords
hidden Markov models; natural language processing; tree data structures; ADtrees; hidden Markov models; n-gram count storing; natural language processing; sequential data; very large natural language corpora; Hidden Markov models; Logic programming; Natural language processing; Natural languages; Programmable logic arrays; Smoothing methods; Speech processing; Statistics; System performance; Tagging;
fLanguage
English
Publisher
ieee
Conference_Titel
Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
Conference_Location
Montreal, Que.
Print_ISBN
978-1-4244-0990-7
Electronic_ISBN
978-1-4244-0991-4
Type
conf
DOI
10.1109/ICSMC.2007.4413704
Filename
4413704
Link To Document