• DocumentCode
    249487
  • Title

    RABID: A Distributed Parallel R for Large Datasets

  • Author

    Hao Lin ; Shuo Yang ; Midkiff, Samuel P.

  • Author_Institution
    Electr. & Comput. Eng., Purdue Univ., West Lafayette, IN, USA
  • fYear
    2014
  • fDate
    June 27 2014-July 2 2014
  • Firstpage
    725
  • Lastpage
    732
  • Abstract
    Large-scale data mining and deep data analysis are increasingly important for both enterprise and scientific applications. Statistical languages provide rich functionality and ease of use for data analysis and modeling and have a large user base. R is one of the most widely used of these languages, but is limited to a single threaded execution model and problem sizes that fit in a single node. This paper describes highly parallel R system called RABID (R Analytics for BIg Data) that maintains R compatibility, leverages the MapReducelike distributed Spark and achieves high performance and scaling across clusters. Our experimental evaluation shows that RABID performs up to 5x faster than Hadoop and 20x faster than RHIPE on two data mining applications.
  • Keywords
    data mining; distributed processing; statistical analysis; MapReducelike distributed spark; RABID; data analysis; data mining; distributed parallel R; enterprise applications; large datasets; scientific applications; single threaded execution model; statistical languages; Data structures; Distributed databases; Fault tolerance; Fault tolerant systems; Programming; Servers; Sparks; Big Data analytics; Data mining; Distributed Computing; R;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (BigData Congress), 2014 IEEE International Congress on
  • Conference_Location
    Anchorage, AK
  • Print_ISBN
    978-1-4799-5056-0
  • Type

    conf

  • DOI
    10.1109/BigData.Congress.2014.107
  • Filename
    6906850