Title :
ADtrees for sequential data and n-gram Counting
Author :
Dam, Rob Van ; Ventura, Dan
Author_Institution :
Brigham Young Univ., Provo
Abstract :
We consider the problem of efficiently storing n- gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the AD tree. Here, we adapt the AD tree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naive approach to storing n-grams and is also significantly more efficient than a traditional prefix tree.
Keywords :
hidden Markov models; natural language processing; tree data structures; ADtrees; hidden Markov models; n-gram count storing; natural language processing; sequential data; very large natural language corpora; Hidden Markov models; Logic programming; Natural language processing; Natural languages; Programmable logic arrays; Smoothing methods; Speech processing; Statistics; System performance; Tagging;
Conference_Titel :
Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on
Conference_Location :
Montreal, Que.
Print_ISBN :
978-1-4244-0990-7
Electronic_ISBN :
978-1-4244-0991-4
DOI :
10.1109/ICSMC.2007.4413704