• DocumentCode
    875790
  • Title

    A parallel sort merge join algorithm for managing data skew

  • Author

    Wolf, Joel L. ; Dias, Daniel M. ; Yu, Philip S.

  • Author_Institution
    IBM T. J. Watson Res. Center. Yorktown Heights, NY, USA
  • Volume
    4
  • Issue
    1
  • fYear
    1993
  • fDate
    1/1/1993 12:00:00 AM
  • Firstpage
    70
  • Lastpage
    86
  • Abstract
    A parallel sort-merge-join algorithm which uses a divide-and-conquer approach to address the data skew problem is proposed. The proposed algorithm adds an extra, low-cost scheduling phase to the usual sort, transfer, and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution of data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase, and is shown to be very robust relative, among other things, to the degree of data skew and the total number of processors
  • Keywords
    distributed databases; merging; parallel algorithms; relational algebra; relational databases; sorting; Zipf-like distribution; data skew management; divide-and-conquer; join phases; load balancing; multiple processors; parallel sort merge join algorithm; parallelizable optimization algorithm; scheduling phase; sort phase; transfer phase; Costs; Delay; Load management; Parallel architectures; Parallel processing; Processor scheduling; Proposals; Relational databases; Robustness; Scheduling algorithm;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/71.205654
  • Filename
    205654