• DocumentCode
    3145264
  • Title

    RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

  • Author

    He, Yongqiang ; Lee, Rubao ; Huai, Yin ; Shao, Zheng ; Jain, Namit ; Zhang, Xiaodong ; Xu, Zhiwei

  • fYear
    2011
  • fDate
    11-16 April 2011
  • Firstpage
    1199
  • Lastpage
    1208
  • Abstract
    MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely row-stores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo!
  • Keywords
    Web services; data analysis; data structures; data warehouses; query processing; social networking (online); Facebook production systems; Hadoop system; MapReduce-based warehouse systems; RCFile; Web service providers; Yahoo!; data analytics; data placement structure; data processing; distributed systems; fast data loading; fast query processing; large data analysis; record columnar file; social network sites; storage space utilization; user behavior trends; Data compression; Data handling; Data storage systems; Data warehouses; Facebook; Information management; Query processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on
  • Conference_Location
    Hannover
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4244-8959-6
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2011.5767933
  • Filename
    5767933