• DocumentCode
    2447447
  • Title

    Characterization of Hadoop Jobs Using Unsupervised Learning

  • Author

    Aggarwal, Sonali ; Phadke, Shashank ; Bhandarkar, Milind

  • fYear
    2010
  • fDate
    Nov. 30 2010-Dec. 3 2010
  • Firstpage
    748
  • Lastpage
    753
  • Abstract
    MapReduce programming paradigm and its open source implementation, Apache Hadoop, is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the cloud service providers and their users. This work characterizes Hadoop jobs running on production clusters at Yahoo! using unsupervised learning. Unsupervised clustering techniques have been applied to many important problems - ranging from Social Network Analysis to Biomedical Research. We use these techniques to cluster Hadoop MapReduce jobs that are similar in characteristics., Hadoop framework generates metrics for every MapReduce job, such as number of map and reduce tasks, number of bytes read/written to local file system and HDFS etc. We use these metrics and job configuration features such as format of the input/output files, type of compression used etc to find similarity among Hadoop jobs. We study the centroids and densities of these job clusters. We also perform comparative analysis of real production workload and workload emulated by our benchmark tool, Grid Mix, by comparing job clusters of both workloads.
  • Keywords
    Web services; cloud computing; distributed programming; public domain software; unsupervised learning; Apache Hadoop; Grid Mix; Hadoop Jobs; MapReduce programming paradigm; biomedical research; cloud computing; cloud service; data intensive application; open source implementation; social network analysis; unsupervised learning; Benchmark testing; Clustering algorithms; History; Measurement; Production; Radiation detectors; Unsupervised learning; performance benchmark; workload characterization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on
  • Conference_Location
    Indianapolis, IN
  • Print_ISBN
    978-1-4244-9405-7
  • Electronic_ISBN
    978-0-7695-4302-4
  • Type

    conf

  • DOI
    10.1109/CloudCom.2010.20
  • Filename
    5708526