Title : 
Using Index in the MapReduce Framework
         
        
            Author : 
An, Mingyuan ; Wang, Yang ; Wang, Weiping
         
        
            Author_Institution : 
Key Lab. of Comput. Syst. & Archit., Chinese Acad. of Sci., Beijing, China
         
        
        
        
        
        
            Abstract : 
MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to scan and process the data in the block, then Reduces merge outputs from all the Maps. When a query intends to process only a subset of the data selected by a predicate, this brute-force method may cause extra I/O overhead spent on irrelevant data, and the overhead for initiating so many Maps may be non-trivial given that the actually interesting data for the query is comparatively small in volume. We propose an approach to integrate the index into the MapReduce execution in which only an appropriate number of Maps are generated, each of which accesses the data using an index. This approach incurs random I/O and remote access to data, so the overall performance depends on both system parameters and the query characteristics. We build a cost model for both this index access execution and the traditional full scan execution. This cost model can be used to choose between the two execution modes before executing a query. Experiments show that the index access execution can greatly outperform full scan execution when the selectivity of the predicate is low, and the cost model predicts the actual execution cost very well so can be used to determine the execution plan for a query.
         
        
            Keywords : 
data structures; parallel programming; Google; MapReduce framework; index; large scale data processing; random I/O; remote data access; Computer science; Costs; Delay; Energy efficiency; Energy storage; Flash memory; Indexing; Mechanical factors; Nonvolatile memory; Tree data structures; MapReduce; access methods; cost model; index;
         
        
        
        
            Conference_Titel : 
Web Conference (APWEB), 2010 12th International Asia-Pacific
         
        
            Conference_Location : 
Busan
         
        
            Print_ISBN : 
978-1-7695-4012-2
         
        
            Electronic_ISBN : 
978-1-4244-6600-9
         
        
        
            DOI : 
10.1109/APWeb.2010.12