DocumentCode :
3686926
Title :
A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning
Author :
Jianwu Wang;Yan Tang;Mai Nguyen;Ilkay Altintas
Author_Institution :
San Diego Supercomput. Center, Univ. of California, San Diego, La Jolla, CA, USA
fYear :
2014
Firstpage :
16
Lastpage :
25
Abstract :
In the Big Data era, machine learning has more potential to discover valuable insights from the data. As an important machine learning technique, Bayesian Network (BN) has been widely used to model probabilistic relationships among variables. To deal with the challenges of Big Data PN learning, we apply the techniques in distributed data-parallelism (DDP) and scientific workflow to the BN learning process. We first propose an intelligent Big Data pre-processing approach and a data quality score to measure and ensure the data quality and data faithfulness. Then, a new weight based ensemble algorithm is proposed to learn a BN structure from an ensemble of local results. To easily integrate the algorithm with DDP engines, such as Hadoop, we employ Kepler scientific workflow to build the whole learning process. We demonstrate how Kepler can facilitate building and running our Big Data BN learning application. Our experiments show good scalability and learning accuracy when running the application in real distributed environments.
Keywords :
"Big data","Engines","Bayes methods","Partitioning algorithms","Accuracy","Algorithm design and analysis","Distributed databases"
Publisher :
ieee
Conference_Titel :
Big Data Computing (BDC), 2014 IEEE/ACM International Symposium on
Type :
conf
DOI :
10.1109/BDC.2014.10
Filename :
7321725
Link To Document :
بازگشت