• DocumentCode
    3450720
  • Title

    Efficient Updates for Web-Scale Indexes over the Cloud

  • Author

    Antonopoulos, Panagiotis ; Konstantinou, Ioannis ; Tsoumakos, Dimitrios ; Koziris, Nectarios

  • Author_Institution
    Microsoft Corp, Redmond, WA, USA
  • fYear
    2012
  • fDate
    1-5 April 2012
  • Firstpage
    135
  • Lastpage
    142
  • Abstract
    In this paper, we present a distributed system which enables fast and frequent updates on web-scale Inverted Indexes. The proposed update technique allows incremental processing of new or modified data and minimizes the changes required to the index, significantly reducing the update time which is now independent of the existing index size. By utilizing Hadoop MapReduce, for parallelizing the update operations, and HBase, for distributing the Inverted Index, we create a high-performance, fully distributed index creation and update system. To the best of our knowledge, this is the first open source system that creates, updates and serves large-scale indexes in a distributed fashion. Experiments with over 23 million Wikipedia documents demonstrate the speed and robustness of our implementation: It scales linearly with the size of the updates and the degree of change in the documents and demonstrates a constant update time regardless of the size of the underlying index. Moreover, our approach significantly increases its performance as more computational resources are acquired: It incorporates a 15.4GB update batch to a 64.2GB indexed dataset in about 21 minutes using just 12 commodity nodes, 3.3 times faster compared to using two nodes.
  • Keywords
    Web sites; cloud computing; distributed processing; document handling; Hadoop MapReduce; Web scale inverted indexes; Wikipedia documents; cloud computing; distributed fashion; distributed system; inverted index; open source system; Educational institutions; Electronic publishing; Encyclopedias; Indexing; Internet;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering Workshops (ICDEW), 2012 IEEE 28th International Conference on
  • Conference_Location
    Arlington, VA
  • Print_ISBN
    978-1-4673-1640-8
  • Type

    conf

  • DOI
    10.1109/ICDEW.2012.51
  • Filename
    6313670