Title : 
Extending Map-Reduce for Efficient Predicate-Based Sampling
         
        
            Author : 
Grover, Raman ; Carey, Michael J.
         
        
            Author_Institution : 
Dept. of Comput. Sci., Univ. of California, Irvine, CA, USA
         
        
        
        
        
        
            Abstract : 
In this paper we address the problem of using MapReduce to sample a massive data set in order to produce a fixed-size sample whose contents satisfy a given predicate. While it is simple to express this computation using MapReduce, its default Hadoop execution is dependent on the input size and is wasteful of cluster resources. This is unfortunate, as sampling queries are fairly common (e.g., for exploratory data analysis at Facebook), and the resulting waste can significantly impact the performance of a shared cluster. To address such use cases, we present the design, implementation and evaluation of a Hadoop execution model extension that supports incremental job expansion. Under this model, a job consumes input as required and can dynamically govern its resource consumption while producing the required results. The proposed mechanism is able to support a variety of policies regarding job growth rates as they relate to cluster capacity and current load. We have implemented the mechanism in Hadoop, and we present results from an experimental performance study of different job growth policies under both single- and multi-user workloads.
         
        
            Keywords : 
data handling; Hadoop execution model extension; MapReduce; cluster capacity; cluster resource; fixed-size sample; incremental job expansion; job growth policy; massive data sampling; multiuser workload; predicate-based sampling; resource consumption; single-user workload; Availability; Delay; Facebook; Indexes; Load modeling; Runtime; Time factors;
         
        
        
        
            Conference_Titel : 
Data Engineering (ICDE), 2012 IEEE 28th International Conference on
         
        
            Conference_Location : 
Washington, DC
         
        
        
            Print_ISBN : 
978-1-4673-0042-1
         
        
        
            DOI : 
10.1109/ICDE.2012.104