• DocumentCode
    172885
  • Title

    Dependency-Aware Data Locality for MapReduce

  • Author

    Xiaoyi Fan ; Xiaoqiang Ma ; Jiangchuan Liu ; Dan Li

  • Author_Institution
    Sch. of Comput. Sci., Simon Fraser Univ., Burnaby, BC, Canada
  • fYear
    2014
  • fDate
    June 27 2014-July 2 2014
  • Firstpage
    408
  • Lastpage
    415
  • Abstract
    Recent years have witnessed the prevalence of MapReduce-based systems, e.g., the Apache Hadoop, in large-scale distributed data processing. Fetching data from remote servers across multiple network switches is known to be costly. Hence, it is highly desirable to co-locate computation with data. State-of-the-art popularity-based replication achieves data locality through replicating popular files and spreading the replicas over multiple servers. While working well for independent files, they can store highly dependent files in different servers, resulting in excessive remote data accesses exchanges and consequently prolonging the job completion time. In this paper, we develop DALM (Dependency-Aware Locality for MapReduce), a novel replication strategy for general real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a data-locality framework that comprehensively weights such key factors as popularity and storage budget. We extensively evaluate DALM through both simulations and real-world implementations, and have compared with state-of-the-art solutions, including the Hadoop system and the popularity-based Scarlett. The results show that DALM can significantly improve data locality for different inputs. For a popular iterative graph processing application on Hadoop, our prototype implementation of DALM reduces the remote data access and job completion time by 34.3% and 9.4%, respectively.
  • Keywords
    data analysis; graph theory; iterative methods; Apache Hadoop; DALM; data-dependency; dependency-aware data locality for MapReduce; fetching data; iterative graph processing application; job completion time reduction; large-scale distributed data processing; popularity-based replication; remote data access reduction; remote servers; replication strategy; Clustering algorithms; Communities; Educational institutions; Partitioning algorithms; Prototypes; Servers; Social network services; Cloud Computing; Data Center; Network;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing (CLOUD), 2014 IEEE 7th International Conference on
  • Conference_Location
    Anchorage, AK
  • Print_ISBN
    978-1-4799-5062-1
  • Type

    conf

  • DOI
    10.1109/CLOUD.2014.62
  • Filename
    6973768