DocumentCode
249487
Title
RABID: A Distributed Parallel R for Large Datasets
Author
Hao Lin ; Shuo Yang ; Midkiff, Samuel P.
Author_Institution
Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
fYear
2014
fDate
June 27 2014-July 2 2014
Firstpage
725
Lastpage
732
Abstract
Large-scale data mining and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have a large user base. R is one of the most widely used of these languages, but is limited to a single threaded execution model and problem sizes that fit in a single node. This paper describes highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReducelike distributed Spark and achieves high performance and scaling across clusters. Our experimental evaluation shows that RABID performs up to 5x faster than Hadoop and 20x faster than RHIPE on two data mining applications.
Keywords
data mining; distributed processing; statistical analysis; MapReducelike distributed spark; RABID; data analysis; data mining; distributed parallel R; enterprise applications; large datasets; scientific applications; single threaded execution model; statistical languages; Data structures; Distributed databases; Fault tolerance; Fault tolerant systems; Programming; Servers; Sparks; Big Data analytics; Data mining; Distributed Computing; R;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data (BigData Congress), 2014 IEEE International Congress on
Conference_Location
Anchorage, AK
Print_ISBN
978-1-4799-5056-0
Type
conf
DOI
10.1109/BigData.Congress.2014.107
Filename
6906850
Link To Document