A Parallel Distributed Weka Framework for Big Data Mining Using Spark

Author

Koliopoulos, Aris-Kyriakos ; Yiapanis, Paraskevas ; Tekiner, Firat ; Nenadic, Goran ; Keane, John

Author_Institution

Sch. of Comput. Sci., Univ. of Manchester, Manchester, UK

fYear

2015

Firstpage

9

Lastpage

16

Abstract

Effective Big Data Mining requires scalable and efficient solutions that are also accessible to users of all levels of expertise. Despite this, many current efforts to provide effective knowledge extraction via large-scale Big Data Mining tools focus more on performance than on use and tuning which are complex problems even for experts. Weka is a popular and comprehensive Data Mining workbench with a well-known and intuitive interface, nonetheless it supports only sequential single-node execution. Hence, the size of the datasets and processing tasks that Weka can handle within its existing environment is limited both by the amount of memory in a single node and by sequential execution. This work discusses DistributedWekaSpark, a distributed framework for Weka which maintains its existing user interface. The framework is implemented on top of Spark, a Hadoop-related distributed framework with fast in-memory processing capabilities and support for iterative computations. By combining Weka´s usability and Spark´s processing power, DistributedWekaSpark provides a usable prototype distributed Big Data Mining workbench that achieves near-linear scaling in executing various real-world scale workloads - 91.4% weak scaling efficiency on average and up to 4x faster on average than Hadoop.

Keywords

Big Data; data mining; parallel processing; user interfaces; DistributedWekaSpark; Hadoop-related distributed framework; Spark processing power; Weka usability; datasets size; distributed Big Data mining; fast in-memory processing capabilities; iterative computations; knowledge extraction; large-scale Big Data mining tools; parallel distributed Weka framework; processing tasks; sequential single-node execution; user interface; Algorithm design and analysis; Big data; Computational modeling; Data mining; Load modeling; Object oriented modeling; Sparks; Big Data; Data Mining; Distributed Systems; Machine Learning; Spark; Weka;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data (BigData Congress), 2015 IEEE International Congress on

Conference_Location

New York, NY

Print_ISBN

978-1-4673-7277-0

Type

conf

DOI

10.1109/BigDataCongress.2015.12

Filename

7207196