Author :
Elnikety, Eslam ; Elsayed, Tamer ; Ramadan, Hany E.
Abstract :
MapReduce is a distributed programming framework designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and task scheduling, is completely unaware of the nature of iterative applications, tasks are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal execution for a consecutive set of iterations. This work presents iHadoop, a modified MapReduce model, and an associated implementation, optimized for iterative computations. The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop´s task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application´s latency. This paper also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of MapReduce. Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that caches invaria- t data between iterations, reduces execution time by 38% on average.
Keywords :
data analysis; data flow analysis; data mining; distributed programming; electronic data interchange; graph theory; hypermedia; iterative methods; learning (artificial intelligence); scheduling; social networking (online); software performance evaluation; very large databases; CPU cycles; Hadoop implementation; I/O cycles; MapReduce model; Web hyperlink structures; asynchronous iterations; bandwidth cycles; commodity machines; data analysis applications; data mining applications; data transfer; dataflow techniques; distributed programming framework; iHadoop model; inter-iteration data locality; iterative algorithms; iterative computations; large datasets; machine learning; open source implementation; optimal execution; performance evaluation; scalable data-intensive applications; single iteration; social network graphs; synthetic datasets; task scheduling; Computational modeling; Fault tolerance; Fault tolerant systems; Iterative methods; Load management; Programming; Schedules; Asynchronous; Cluster; Iterative Algorithms; MapReduce; Parallel Data Processing;