مرکز منطقه ای اطلاع رساني علوم و فناوري - Dacoop: Accelerating Data-Iterative Applications on Map/Reduce Cluster

DocumentCode :

2866303

Title :

Dacoop: Accelerating Data-Iterative Applications on Map/Reduce Cluster

Author :

Liang, Yi ; Li, Guangrui ; Wang, Lei ; Hu, Yanpeng

Author_Institution :

Dept. of Comput. Sci., Beijing Univ. of Technol., Beijing, China

fYear :

2011

fDate :

20-22 Oct. 2011

Firstpage :

207

Lastpage :

214

Abstract :

Map/reduce is a popular parallel processing framework for massive-scale data-intensive computing. The data-iterative application is composed of a serials of map/reduce jobs and need to repeatedly process some data files among these jobs. The existing implementation of map/reduce framework focus on perform data processing in a single pass with one map/reduce job and do not directly support the data-iterative applications, particularly in term of the explicit specification of the repeatedly processed data among jobs. In this paper, we propose an extended version of Hadoop map/reduce framework called Dacoop. Dacoop extends Map/Reduce programming interface to specify the repeatedly processed data, introduces the shared memory-based data cache mechanism to cache the data since its first access, and adopts the caching-aware task scheduling so that the cached data can be shared among the map/reduce jobs of data-iterative applications. We evaluate Dacoop on two typical data-iterative applications: k-means clustering and the domain rule reasoning in sementic web, with real and synthetic datasets. Experimental results show that the data-iterative applications can gain better performance on Dacoop than that on Hadoop. The turnaround time of a data-iterative application can be reduced by the maximum of 15.1%.

Keywords :

cache storage; iterative methods; parallel processing; pattern clustering; semantic Web; shared memory systems; task analysis; Dacoop; Hadoop MapReduce framework; MapReduce cluster; MapReduce programming interface; caching-aware task scheduling; data file; data iterative application; data processing; domain rule reasoning; k-means clustering; massive scale data intensive computing; parallel processing framework; real dataset; repeatedly processed data; semantic Web; shared memory-based data cache mechanism; synthetic dataset; Computer architecture; Data models; Distributed databases; Load modeling; Processor scheduling; Programming; Scheduling; data cache; data-iterative application; map/reduce; shared memory; task scheduling;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011 12th International Conference on

Conference_Location :

Gwangju

Print_ISBN :

978-1-4577-1807-6

Type :

conf

DOI :

10.1109/PDCAT.2011.32

Filename :

6118932

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2866303