DocumentCode
3450720
Title
Efficient Updates for Web-Scale Indexes over the Cloud
Author
Antonopoulos, Panagiotis ; Konstantinou, Ioannis ; Tsoumakos, Dimitrios ; Koziris, Nectarios
Author_Institution
Microsoft Corp, Redmond, WA, USA
fYear
2012
fDate
1-5 April 2012
Firstpage
135
Lastpage
142
Abstract
In this paper, we present a distributed system which enables fast and frequent updates on web-scale Inverted Indexes. The proposed update technique allows incremental processing of new or modified data and minimizes the changes required to the index, significantly reducing the update time which is now independent of the existing index size. By utilizing Hadoop MapReduce, for parallelizing the update operations, and HBase, for distributing the Inverted Index, we create a high-performance, fully distributed index creation and update system. To the best of our knowledge, this is the first open source system that creates, updates and serves large-scale indexes in a distributed fashion. Experiments with over 23 million Wikipedia documents demonstrate the speed and robustness of our implementation: It scales linearly with the size of the updates and the degree of change in the documents and demonstrates a constant update time regardless of the size of the underlying index. Moreover, our approach significantly increases its performance as more computational resources are acquired: It incorporates a 15.4GB update batch to a 64.2GB indexed dataset in about 21 minutes using just 12 commodity nodes, 3.3 times faster compared to using two nodes.
Keywords
Web sites; cloud computing; distributed processing; document handling; Hadoop MapReduce; Web scale inverted indexes; Wikipedia documents; cloud computing; distributed fashion; distributed system; inverted index; open source system; Educational institutions; Electronic publishing; Encyclopedias; Indexing; Internet;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Engineering Workshops (ICDEW), 2012 IEEE 28th International Conference on
Conference_Location
Arlington, VA
Print_ISBN
978-1-4673-1640-8
Type
conf
DOI
10.1109/ICDEW.2012.51
Filename
6313670
Link To Document