Title :
Pythia: Faster Big Data in Motion through Predictive Software-Defined Network Optimization at Runtime
Author :
Veiga Neves, Marcelo ; De Rose, C.A.F. ; Katrinis, K. ; Franke, Hubertus
Author_Institution :
Pontifical Catholic Univ. of Rio Grande do Sul, Porto Alegre, Brazil
Abstract :
The rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce framework, as implemented in Hadoop, is one of the most popular frameworks for Big Data analysis. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MapReduce contributes significantly to the overall response time, the problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MapReduce workloads. In this paper we present a system that reduces the skew impact by transparently predicting data communication volume at runtime and mapping the many end-to-end flows among the various processes to the underlying network, using emerging software-defined networking technologies to avoid hotspots in the network. Dependent on the network oversubscription ratio, we demonstrate reduction in job completion time between 3% and 46% for popular MapReduce benchmarks like Sort and Nutch.
Keywords :
Big Data; computer networks; parallel programming; telecommunication traffic; Big Data analytics; Hadoop; MapReduce workloads; Nutch MapReduce benchmark; Pythia; Sort MapReduce benchmark; communication patterns; communication-heavy phase; compute requirements; data communication volume prediction; data size; disk I/O requirements; end-to-end flow mapping; job completion time reduction; network oversubscription ratio; network traffic; predictive software-defined network optimization; response time; runtime analysis; scalable framework; system scaling; unbound server numbers; Big data; Instruments; Job shop scheduling; Resource management; Routing; Runtime; Servers; Data communication; Data processing; Distributed computing;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2014 IEEE 28th International
Conference_Location :
Phoenix, AZ
Print_ISBN :
978-1-4799-3799-8
DOI :
10.1109/IPDPS.2014.20