DocumentCode :
659482
Title :
The BTWorld use case for big data analytics: Description, MapReduce logical workflow, and empirical evaluation
Author :
Hegeman, Tim ; Ghit, Bogdan ; Capota, M. ; Hidders, Jan ; Epema, Dick ; Iosup, Alexandru
Author_Institution :
Parallel & Distrib. Syst. Group, Delft Univ. of Technol., Delft, Netherlands
fYear :
2013
fDate :
6-9 Oct. 2013
Firstpage :
622
Lastpage :
630
Abstract :
The commoditization of big data analytics, that is, the deployment, tuning, and future development of big data processing platforms such as MapReduce, relies on a thorough understanding of relevant use cases and workloads. In this work we propose BTWorld, a use case for time-based big data analytics that is representative for processing data collected periodically from a global-scale distributed system. BTWorld enables a data-driven approach to understanding the evolution of BitTorrent, a global file-sharing network that has over 100 million users and accounts for a third of today´s upstream traffic. We describe for this use case the analyst questions and the structure of a multi-terabyte data set. We design a MapReduce-based logical workflow, which includes three levels of data dependency - inter-query, inter-job, and intra-job - and a query diversity that make the BTWorld use case challenging for today´s big data processing tools; the workflow can be instantiated in various ways in the MapReduce stack. Last, we instantiate this complex workflow using Pig-Hadoop-HDFS and evaluate the use case empirically. Our MapReduce use case has challenging features: small (kilobytes) to large (250 MB) data sizes per observed item, excellent (10-6) and very poor (102) selectivity, and short (seconds) to long (hours) job duration.
Keywords :
Big Data; data analysis; peer-to-peer computing; query processing; BTWorld use case; BitTorrent evolution understanding; MapReduce logical workflow; MapReduce stack; MapReduce-based logical workflow; Pig-Hadoop-HDFS; big data processing platform; data dependency; data-driven approach; global file-sharing network; global-scale distributed system; interjob; interquery; intrajob; job duration; multiterabyte data set; query diversity; time-based big data analytics; upstream traffic; Data handling; Data processing; Data storage systems; Engines; Information management; Open source software; Programming;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
Type :
conf
DOI :
10.1109/BigData.2013.6691631
Filename :
6691631
Link To Document :
بازگشت