• DocumentCode
    2766772
  • Title

    An Analysis of Traces from a Production MapReduce Cluster

  • Author

    Kavulya, Soila ; Tan, Jason ; Gandhi, Rajeev ; Narasimhan, Priya

  • Author_Institution
    Carnegie Mellon Univ., Pittsburgh, PA, USA
  • fYear
    2010
  • fDate
    17-20 May 2010
  • Firstpage
    94
  • Lastpage
    103
  • Abstract
    MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for academic research. We characterize resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset.
  • Keywords
    parallel processing; scheduling; M45 supercomputing cluster; Yahoo!; cloud computing environments; instance-based learning technique; parallel processing; production MapReduce cluster; Cloud computing; Costs; Data mining; Grid computing; Large-scale systems; Parallel processing; Parallel programming; Performance analysis; Processor scheduling; Production; Distributed systems; MapReduce; Workload characterization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on
  • Conference_Location
    Melbourne, VIC
  • Print_ISBN
    978-1-4244-6987-1
  • Type

    conf

  • DOI
    10.1109/CCGRID.2010.112
  • Filename
    5493490