Title : 
Characterization of Hadoop Jobs Using Unsupervised Learning
         
        
            Author : 
Aggarwal, Sonali ; Phadke, Shashank ; Bhandarkar, Milind
         
        
        
            fDate : 
Nov. 30 2010-Dec. 3 2010
         
        
        
        
            Abstract : 
MapReduce programming paradigm and its open source implementation, Apache Hadoop, is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the cloud service providers and their users. This work characterizes Hadoop jobs running on production clusters at Yahoo! using unsupervised learning. Unsupervised clustering techniques have been applied to many important problems - ranging from Social Network Analysis to Biomedical Research. We use these techniques to cluster Hadoop MapReduce jobs that are similar in characteristics., Hadoop framework generates metrics for every MapReduce job, such as number of map and reduce tasks, number of bytes read/written to local file system and HDFS etc. We use these metrics and job configuration features such as format of the input/output files, type of compression used etc to find similarity among Hadoop jobs. We study the centroids and densities of these job clusters. We also perform comparative analysis of real production workload and workload emulated by our benchmark tool, Grid Mix, by comparing job clusters of both workloads.
         
        
            Keywords : 
Web services; cloud computing; distributed programming; public domain software; unsupervised learning; Apache Hadoop; Grid Mix; Hadoop Jobs; MapReduce programming paradigm; biomedical research; cloud computing; cloud service; data intensive application; open source implementation; social network analysis; unsupervised learning; Benchmark testing; Clustering algorithms; History; Measurement; Production; Radiation detectors; Unsupervised learning; performance benchmark; workload characterization;
         
        
        
        
            Conference_Titel : 
Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on
         
        
            Conference_Location : 
Indianapolis, IN
         
        
            Print_ISBN : 
978-1-4244-9405-7
         
        
            Electronic_ISBN : 
978-0-7695-4302-4
         
        
        
            DOI : 
10.1109/CloudCom.2010.20