DocumentCode :
2541189
Title :
ADtrees for sequential data and n-gram Counting
Author :
Dam, Rob Van ; Ventura, Dan
Author_Institution :
Brigham Young Univ., Provo
fYear :
2007
fDate :
7-10 Oct. 2007
Firstpage :
492
Lastpage :
497
Abstract :
We consider the problem of efficiently storing n- gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the AD tree. Here, we adapt the AD tree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.
Keywords :
hidden Markov models; natural language processing; tree data structures; ADtrees; hidden Markov models; n-gram count storing; natural language processing; sequential data; very large natural language corpora; Hidden Markov models; Logic programming; Natural language processing; Natural languages; Programmable logic arrays; Smoothing methods; Speech processing; Statistics; System performance; Tagging;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
Conference_Location :
Montreal, Que.
Print_ISBN :
978-1-4244-0990-7
Electronic_ISBN :
978-1-4244-0991-4
Type :
conf
DOI :
10.1109/ICSMC.2007.4413704
Filename :
4413704
Link To Document :
بازگشت