• DocumentCode
    3422594
  • Title

    A New Approach for Document Indexing UsingWavelet Trees

  • Author

    Brisaboa, Nieves R. ; Cillero, Yolanda ; Fariña, Antonio ; Ladra, Susana ; Pedreira, Oscar

  • Author_Institution
    Univ. of A Coruna, A Coruna
  • fYear
    2007
  • fDate
    3-7 Sept. 2007
  • Firstpage
    69
  • Lastpage
    73
  • Abstract
    The development of applications that manage large text collections needs indexing methods which allow efficient retrieval over text. Several indexes have been proposed which try to reach a good trade-off between the space needed to store both the text and the index, and its search efficiency. Self-indexes are becoming more and more popular. Not only they index the text, but they keep enough information to recover any portion of it without the need of keeping it explicitly. Therefore, they actually replace the text. In this paper, we focus in a self-index known as wavelet tree. Being originally organized as a binary tree, it was designed to index the characters from a text. We present three variants of this method that aim at reducing its size, while keeping a good trade-off between space and performance, as well as making it well-suited for indexing natural language texts. The first approach we describe joins Huffman compression and wavelet trees. The other two new variants index words instead of characters and use two different word-based compressors.
  • Keywords
    Huffman codes; data compression; indexing; information retrieval; natural language processing; text analysis; trees (mathematics); wavelet transforms; Huffman compression; binary tree; document indexing; large text collection management; natural language texts indexing; self-indexes; text retrieval; wavelet trees; word-based compressors; Binary trees; Compressors; Databases; Expert systems; Indexing; Information retrieval; Laboratories; Multiple signal classification; Natural languages; Streaming media;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications, 2007. DEXA '07. 18th International Workshop on
  • Conference_Location
    Regensburg
  • ISSN
    1529-4188
  • Print_ISBN
    978-0-7695-2932-5
  • Type

    conf

  • DOI
    10.1109/DEXA.2007.118
  • Filename
    4312859