Comparison of a sequential and a MapReduce approach to joining large datasets

Author

Lalic, Marko ; Memic, Emina ; Kesan, Faruk ; Gondzic, Edita ; Smajic, Nermin ; Nosovic, Novica

Author_Institution

Dept. for Comput. & Inf., Univ. of Sarajevo, Sarajevo, Bosnia-Herzegovina

fYear

2013

fDate

20-24 May 2013

Firstpage

1289

Lastpage

1291

Abstract

MapReduce as a programming model is considered one of the biggest improvements in massive data processing which utilizes parallelization. The increasing amount of data being processed and stored has caused a need to investigate more efficient solutions to common problems, one of which is performing a join operation on two interconnected datasets. In this paper, a classic sequential solution to this problem is compared with a MapReduce approach, with the intent of discovering the relative advantages of the two. The sequential application runtime for datasets of negligible sizes in today´s terms is proven prohibitively slow. Furthermore, a MapReduce cluster of five Amazon EC2 nodes is shown to process, in the same time period, ten times larger data than the sequential application.

Keywords

data analysis; pattern clustering; programming; very large databases; Amazon EC2 nodes; MapReduce cluster; interconnected datasets; large datasets; massive data processing; programming model; sequential application runtime; Clustering algorithms; Computational modeling; Data processing; Distributed databases; Educational institutions; Facebook; Programming; Hadoop; MapReduce; cluster; distributed join; join;

fLanguage

English

Publisher

ieee

Conference_Titel

Information & Communication Technology Electronics & Microelectronics (MIPRO), 2013 36th International Convention on

Conference_Location

Opatija

Print_ISBN

978-953-233-076-2

Type

conf

Filename

6596457