• DocumentCode
    2308269
  • Title

    Document-Oriented Pruning of the Inverted Index in Information Retrieval Systems

  • Author

    Zheng, Lei ; Cox, Ingemar J.

  • Author_Institution
    Univ. Coll. London, London
  • fYear
    2009
  • fDate
    26-29 May 2009
  • Firstpage
    697
  • Lastpage
    702
  • Abstract
    Searching very large collections can be costly in both computation and storage. To reduce this cost, recent research has focused on reducing the size (pruning) of the inverted index. The inverted index represents a table, the rows and columns of which are terms in the lexicon and documents in the collection, respectively. A non-zero entry in the table, known as a posting, indicates that the corresponding document contains the term. Previous researches on static index pruning was either (i) posting-oriented, in which less important postings are removed from the table, or (ii) term-oriented, in which less important terms are removed from the table. In this paper, we investigate a new, document-oriented pruning strategy that removes entire columns of the table, i.e. removes less important documents from the collection. Three methods for estimating the importance of a document are proposed. Methods 1 and 2 are dependent on the score function of the retrieval system (e.g. Okapi BM25), while Method 3 is independent of the retrieval system. Experimental results compare the three proposed methods with Carmel et al.´s posting-oriented approach, using both the FT and LA Times collections and using both ordinary and difficult queries. Based on mean average precision and precision at 10, experimental results show that Method 3 generally performs best on the FT collection for pruned indexes down to 35% of the original size. However, for more severe pruning, Carmel et al.´s algorithm is better. For the LA Times collection, the performance of Method 3 and that of Carmel et al. are reversed. This variation in performance across collections has not been previously reported.
  • Keywords
    information retrieval; very large databases; document-oriented pruning; information retrieval systems; inverted index; nonzero entry; posting-oriented approach; very large collections; Advertising; Computer networks; Costs; Data structures; Educational institutions; Indexing; Information retrieval; Search engines; Vocabulary; Web search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Information Networking and Applications Workshops, 2009. WAINA '09. International Conference on
  • Conference_Location
    Bradford
  • Print_ISBN
    978-1-4244-3999-7
  • Electronic_ISBN
    978-0-7695-3639-2
  • Type

    conf

  • DOI
    10.1109/WAINA.2009.147
  • Filename
    5136730