• DocumentCode
    2432506
  • Title

    Data engineering approach to efficient data warehouse: Life cycle development revisited

  • Author

    Daneshpour, Negin ; Barfourosh, Ahmad Abdollahzadeh

  • Author_Institution
    Dept. of Comput. Eng. & Inf. Technol., Amirkabir Univ. of Technol., Tehran, Iran
  • fYear
    2011
  • fDate
    15-16 June 2011
  • Firstpage
    109
  • Lastpage
    120
  • Abstract
    Data warehouse (DW) refers to technologies for collecting, integrating, analyzing large volume of homogeneous/heterogeneous data to provide information to enable better decision making. To achieve the main purpose of data warehouse to present analytical response to online queries it is necessary to consider many parameters in development life cycle. Among all factors involved in DW efficiency the quality of data should be taken more seriously. Today data warehouse architecture typically consists of several components which consolidate data from several operational and historical databases to support a variety of front-end query reporting and analytical tools. The back-end of the architecture is mainly relying on Extract-Transform-Load (ETL) process which we usually prefer to have it as a tool. The design and implementation application dependent ETL to pipeline validated and verified data is a labor intensive and typically consumes a large fraction of effort in data warehouse projects. Outcome of our experiment to build DW based on recommended methodology on thirty three million actual population records confirms that the life cycle of DW development has to be revisited. Many works have been reported regarding to data quality impact on efficiency of DW, but less attentions have been made to recognize data engineering aspects to revise the development life cycle for having efficient DW. Our investigation through last experiment shows 3 following steps facilitate life cycle process, and resulted DW is more tailored. 1) Data cleaning as a pre-process phase before data cleansing on ETL. 2) Identifying query type and their operation before transforming phase on ETL. 3) Identifying and materializing suited view for each query before load phase on ETL. The result regarding, to accuracy, effort and time has been tested and is significantly promising.
  • Keywords
    data warehouses; query processing; data cleaning; data engineering approach; data quality impact; data warehouse architecture; extract-transform-load process; front-end query reporting; life cycle development; Business; Classification algorithms; Cleaning; Data warehouses; Databases; Heuristic algorithms; Time factors; OLAP; data cleaning; data engineering; data warehouse development life cycle; query type classification; view materialization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Software Engineering (CSSE), 2011 CSI International Symposium on
  • Conference_Location
    Tehran
  • Print_ISBN
    978-1-61284-206-6
  • Type

    conf

  • DOI
    10.1109/CSICSSE.2011.5963983
  • Filename
    5963983