DocumentCode
3079465
Title
BigDataDIRAC: Deploying Distributed Big Data Applications
Author
Fernandez, Victor ; Mendez, Victor ; Pena, Tomas F.
Author_Institution
Dept. of Particle Phys., Univ. of Santiago de Compostela, Santiago de Compostela, Spain
fYear
2015
fDate
4-7 May 2015
Firstpage
1177
Lastpage
1180
Abstract
The Distributed Infrastructure with Remote Agent Control (DIRAC) software framework allows a user community to manage computing activities in different grid and cloud environments. Many communities from several fields (LHCb, Belle II, Creatis, DIRAC4EGI multiple community portal, etc.) use DIRAC to run jobs in distributed environments. Google created the MapReduce programming model offering an efficient way of performing distributed computation over large data sets. Several enterprises are providing Hadoop cloud based resources to their users, and are trying to simplify the usage of Hadoop in the cloud. Based in these two robust technologies, we have created BigDataDIRAC, a solution which gives users the opportunity to access multiple Big Data resources scattered in different geographical areas, such as access to grid resources. This approach opens the possibility of offering not only grid and cloud to the users, but also Big Data resources from the same DIRAC environment. Proof of concept is shown using three computing centers in two countries, and with four Hadoop clusters. Our results demonstrate the ability of BigDataDIRAC to manage jobs driven by dataset location stored in the Hadoop File System (HDFS) of the Hadoop distributed clusters. DIRAC is used to monitor the execution, collect the necessary statistical data, and upload the results from the remote HDFS to the SandBox Storage machine. The tests produced the equivalent of 5 days continuous processing.
Keywords
Big Data; cloud computing; grid computing; statistical analysis; BigDataDIRAC; Google; HDFS; Hadoop cloud based resources; Hadoop distributed clusters; Hadoop file system; MapReduce programming model; SandBox Storage machine; cloud environments; computing activities; distributed big data applications; distributed infrastructure with remote agent control software framework; grid environments; statistical data; user community; Big data; Catalogs; Computer architecture; Monitoring; Physics; Portals; Software; Big Data; Cloud Computing; DIRAC; Hadoop; Hive; MapReduce; Multi-cloud environment;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location
Shenzhen
Type
conf
DOI
10.1109/CCGrid.2015.109
Filename
7152615
Link To Document