DocumentCode
3686702
Title
Parallel computation of information gain using Hadoop and MapReduce
Author
Eftim Zdravevski;Petre Lameski;Andrea Kulakov;Sonja Filiposka;Dimitar Trajanov;Boro Jakimovskik
Author_Institution
Faculty of Computer Science and Engineering, Ss.Cyril and Methodius University, Skopje, Macedonia
fYear
2015
Firstpage
181
Lastpage
192
Abstract
Nowadays, companies collect data at an increasingly high rate to the extent that traditional implementation of algorithms cannot cope with it in reasonable time. On the other hand, analysis of the available data is a key to the business success. In a Big Data setting tasks like feature selection, finding discretization thresholds of continuous data, building decision threes, etc are especially difficult. In this paper we discuss how a parallel implementation of the algorithm for computing the information gain can address these issues. Our approach is based on writing Pig Latin scripts that are compiled into MapReduce jobs which then can be executed on Hadoop clusters. In order to implement the algorithm first we define a framework for developing arbitrary algorithms and then we apply it for the task at hand. With intent to analyze the impact of the parallelization, we have processed the FedCSIS AAIA´14 dataset with the proposed implementation of the information gain. During the experiments we evaluate the speedup of the parallelization compared to a one-node cluster. We also analyze how to optimally determine the number of map and reduce tasks for a given cluster. To demonstrate the portability of the implementation we present results using an on-premises and Amazon AWS clusters. Finally, we illustrate the scalability of the implementation by evaluating it on a replicated version of the same dataset which is 80 times larger than the original.
Keywords
"Parallel processing","Servers","Machine learning algorithms","Entropy","Loading","Mathematical model","Writing"
Publisher
ieee
Conference_Titel
Computer Science and Information Systems (FedCSIS), 2015 Federated Conference on
Type
conf
DOI
10.15439/2015F89
Filename
7321440
Link To Document