DocumentCode :
3422594
Title :
A New Approach for Document Indexing UsingWavelet Trees
Author :
Brisaboa, Nieves R. ; Cillero, Yolanda ; Fariña, Antonio ; Ladra, Susana ; Pedreira, Oscar
Author_Institution :
Univ. of A Coruna, A Coruna
fYear :
2007
fDate :
3-7 Sept. 2007
Firstpage :
69
Lastpage :
73
Abstract :
The development of applications that manage large text collections needs indexing methods which allow efficient retrieval over text. Several indexes have been proposed which try to reach a good trade-off between the space needed to store both the text and the index, and its search efficiency. Self-indexes are becoming more and more popular. Not only they index the text, but they keep enough information to recover any portion of it without the need of keeping it explicitly. Therefore, they actually replace the text. In this paper, we focus in a self-index known as wavelet tree. Being originally organized as a binary tree, it was designed to index the characters from a text. We present three variants of this method that aim at reducing its size, while keeping a good trade-off between space and performance, as well as making it well-suited for indexing natural language texts. The first approach we describe joins Huffman compression and wavelet trees. The other two new variants index words instead of characters and use two different word-based compressors.
Keywords :
Huffman codes; data compression; indexing; information retrieval; natural language processing; text analysis; trees (mathematics); wavelet transforms; Huffman compression; binary tree; document indexing; large text collection management; natural language texts indexing; self-indexes; text retrieval; wavelet trees; word-based compressors; Binary trees; Compressors; Databases; Expert systems; Indexing; Information retrieval; Laboratories; Multiple signal classification; Natural languages; Streaming media;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications, 2007. DEXA '07. 18th International Workshop on
Conference_Location :
Regensburg
ISSN :
1529-4188
Print_ISBN :
978-0-7695-2932-5
Type :
conf
DOI :
10.1109/DEXA.2007.118
Filename :
4312859
Link To Document :
بازگشت