• DocumentCode
    592811
  • Title

    RPig: A scalable framework for machine learning and advanced statistical functionalities

  • Author

    MingXue Wang ; Handurukande, S.B. ; Nassar, Mohamed

  • Author_Institution
    Network Manage. Lab., Ericsson Ireland, Ireland
  • fYear
    2012
  • fDate
    3-6 Dec. 2012
  • Firstpage
    293
  • Lastpage
    300
  • Abstract
    In many domains such as Telecom various scenarios necessitate the processing of large amounts of data using statistical and machine learning algorithms. A noticeable effort has been made to move the data management systems into MapReduce parallel processing environments such as Hadoop and Pig. Nevertheless these systems lack the features of advanced machine learning and statistical analysis. Frame-works such as Mahout on top of Hadoop support machine learning but their implementations are at the preliminary stage. For example Mahout does not provide Support Vector Machine (SVM) algorithms and it is difficult to use. On the other hand traditional statistical software tools such as R containing comprehensive statistical algorithms for advanced analysis are widely used. But such software can only run on a single computer and therefore it is not scalable. In this paper we propose an integrated solution RPig which takes the advantages of R (for machine learning and statistical analysis capabilities) and parallel data processing capabilities of Pig. The RPig framework offers a scalable advanced data analysis solution for machine learning and statistical analysis. Analysis jobs can be easily developed with RPig script in high level languages. We describe the design implementation and an eclipse-based RPigEditor for the RPig framework. Using application scenarios from the Telecom domain we show the usage of RPig and how the framework can significantly reduce the development effort. The results demonstrate the scalability of our framework and the simplicity of deployment for analysis jobs.
  • Keywords
    authoring languages; data analysis; learning (artificial intelligence); parallel processing; software tools; statistical analysis; telecommunication computing; Hadoop; MapReduce parallel processing environments; RPig script; Telecom domain; advanced statistical functionality; comprehensive statistical algorithms; data management systems; eclipse-based RPigEditor; high level languages; machine learning; scalable advanced data analysis solution; scalable framework; statistical analysis; statistical software tools; Analytic; Big data; Design; MapReduce; Pig; R;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on
  • Conference_Location
    Taipei
  • Print_ISBN
    978-1-4673-4511-8
  • Electronic_ISBN
    978-1-4673-4509-5
  • Type

    conf

  • DOI
    10.1109/CloudCom.2012.6427480
  • Filename
    6427480