Title :
A New Approach for Document Indexing UsingWavelet Trees
Author :
Brisaboa, Nieves R. ; Cillero, Yolanda ; Fariña, Antonio ; Ladra, Susana ; Pedreira, Oscar
Author_Institution :
Univ. of A Coruna, A Coruna
Abstract :
The development of applications that manage large text collections needs indexing methods which allow efficient retrieval over text. Several indexes have been proposed which try to reach a good trade-off between the space needed to store both the text and the index, and its search efficiency. Self-indexes are becoming more and more popular. Not only they index the text, but they keep enough information to recover any portion of it without the need of keeping it explicitly. Therefore, they actually replace the text. In this paper, we focus in a self-index known as wavelet tree. Being originally organized as a binary tree, it was designed to index the characters from a text. We present three variants of this method that aim at reducing its size, while keeping a good trade-off between space and performance, as well as making it well-suited for indexing natural language texts. The first approach we describe joins Huffman compression and wavelet trees. The other two new variants index words instead of characters and use two different word-based compressors.
Keywords :
Huffman codes; data compression; indexing; information retrieval; natural language processing; text analysis; trees (mathematics); wavelet transforms; Huffman compression; binary tree; document indexing; large text collection management; natural language texts indexing; self-indexes; text retrieval; wavelet trees; word-based compressors; Binary trees; Compressors; Databases; Expert systems; Indexing; Information retrieval; Laboratories; Multiple signal classification; Natural languages; Streaming media;
Conference_Titel :
Database and Expert Systems Applications, 2007. DEXA '07. 18th International Workshop on
Conference_Location :
Regensburg
Print_ISBN :
978-0-7695-2932-5
DOI :
10.1109/DEXA.2007.118