• DocumentCode
    42392
  • Title

    LIBRA: Lightweight Data Skew Mitigation in MapReduce

  • Author

    Qi Chen ; Jinyu Yao ; Zhen Xiao

  • Author_Institution
    Dept. of Comput. Sci., Peking Univ., Beijing, China
  • Volume
    26
  • Issue
    9
  • fYear
    2015
  • fDate
    Sept. 1 2015
  • Firstpage
    2520
  • Lastpage
    2533
  • Abstract
    MapReduce is an effective tool for parallel data processing. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data assigned to each task. This causes some tasks to take much longer to finish than others and can significantly impact performance. This paper presents LIBRA, a lightweight strategy to address the data skew problem among the reducers of MapReduce applications. Unlike previous work, LIBRA does not require any pre-run sampling of the input data or prevent the overlap between the map and the reduce stages. It uses an innovative sampling method which can achieve a highly accurate approximation to the distribution of the intermediate data by sampling only a small fraction of the intermediate data during the normal map processing. It allows the reduce tasks to start copying as soon as the chosen sample map tasks (only a small fraction of map tasks which are issued first) complete. It supports the split of large keys when application semantics permit and the total order of the output data. It considers the heterogeneity of the computing resources when balancing the load among the reduce tasks appropriately. LIBRA is applicable to a wide range of applications and is transparent to the users. We implement LIBRA in Hadoop and our experiments show that LIBRA has negligible overhead and can speed up the execution of some popular applications by up to a factor of 4.
  • Keywords
    data handling; parallel processing; resource allocation; sampling methods; Hadoop; LIBRA; application semantics; computing resources heterogeneity; innovative sampling method; intermediate data distribution; intermediate data sampling; lightweight data skew mitigation; load balancing; map processing; parallel data processing; Approximation methods; Delays; Distributed databases; Indexes; Parallel processing; Sampling methods; Semantics; MapReduce; data skew; partitioning; sampling;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2014.2350972
  • Filename
    6882249