Title :
Dependency-Aware Data Locality for MapReduce
Author :
Xiaoyi Fan ; Xiaoqiang Ma ; Jiangchuan Liu ; Dan Li
Author_Institution :
Sch. of Comput. Sci., Simon Fraser Univ., Burnaby, BC, Canada
fDate :
June 27 2014-July 2 2014
Abstract :
Recent years have witnessed the prevalence of MapReduce-based systems, e.g., the Apache Hadoop, in large-scale distributed data processing. Fetching data from remote servers across multiple network switches is known to be costly. Hence, it is highly desirable to co-locate computation with data. State-of-the-art popularity-based replication achieves data locality through replicating popular files and spreading the replicas over multiple servers. While working well for independent files, they can store highly dependent files in different servers, resulting in excessive remote data accesses exchanges and consequently prolonging the job completion time. In this paper, we develop DALM (Dependency-Aware Locality for MapReduce), a novel replication strategy for general real-world input data that can be highly skewed and dependent. DALM accommodates data-dependency in a data-locality framework that comprehensively weights such key factors as popularity and storage budget. We extensively evaluate DALM through both simulations and real-world implementations, and have compared with state-of-the-art solutions, including the Hadoop system and the popularity-based Scarlett. The results show that DALM can significantly improve data locality for different inputs. For a popular iterative graph processing application on Hadoop, our prototype implementation of DALM reduces the remote data access and job completion time by 34.3% and 9.4%, respectively.
Keywords :
data analysis; graph theory; iterative methods; Apache Hadoop; DALM; data-dependency; dependency-aware data locality for MapReduce; fetching data; iterative graph processing application; job completion time reduction; large-scale distributed data processing; popularity-based replication; remote data access reduction; remote servers; replication strategy; Clustering algorithms; Communities; Educational institutions; Partitioning algorithms; Prototypes; Servers; Social network services; Cloud Computing; Data Center; Network;
Conference_Titel :
Cloud Computing (CLOUD), 2014 IEEE 7th International Conference on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4799-5062-1
DOI :
10.1109/CLOUD.2014.62