• DocumentCode
    988324
  • Title

    A parallel hash join algorithm for managing data skew

  • Author

    Wolf, Joel L. ; Yu, Philip S. ; Turek, John ; Dias, Daniel M.

  • Author_Institution
    IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
  • Volume
    4
  • Issue
    12
  • fYear
    1993
  • fDate
    12/1/1993 12:00:00 AM
  • Firstpage
    1355
  • Lastpage
    1371
  • Abstract
    Presents a parallel hash join algorithm that is based on the concept of hierarchical hashing, to address the problem of data skew. The proposed algorithm splits the usual hash phase into a hash phase and an explicit transfer phase, and adds an extra scheduling phase between these two. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the hash partitions with the largest skew values and splits them as necessary, assigning each of them to an optimal number of processors. Assuming for concreteness a Zipf-like distribution of the values in the join column, a join phase which is CPU-bound, and a shared nothing environment, the algorithm is shown to achieve good join phase load balancing, and to be robust relative to the degree of data skew and the total number of processors. The overall speedup due to this algorithm is compared to some existing parallel hash join methods. The proposed method does considerably better in high skew situations
  • Keywords
    database theory; parallel algorithms; query processing; relational databases; resource allocation; Zipf-like distribution; combinatorial optimization; complex queries; data skew; hash joins; heuristic optimization; hierarchical hashing; join column; load balancing; parallel hash join algorithm; relational databases; scheduling; Delay; Heuristic algorithms; Load management; Parallel processing; Partitioning algorithms; Processor scheduling; Proposals; Relational databases; Robustness; Scheduling algorithm;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/71.250117
  • Filename
    250117