DocumentCode :
1447074
Title :
Optimizing Multiway Joins in a Map-Reduce Environment
Author :
Afrati, Foto N. ; Ullman, Jeffrey D.
Author_Institution :
Sch. of Electr. & Comput. Eng., Nat. Tech. Univ. of Athens, Athens, Greece
Volume :
23
Issue :
9
fYear :
2011
Firstpage :
1282
Lastpage :
1298
Abstract :
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network.
Keywords :
Internet; parallel processing; query processing; social networking (online); Web; analytic queries; map-key attributes; map-reduce environment; multiway joins optimization; parallel computing; query optimization; reduce process; social network; Clustering algorithms; Electronic mail; Equations; Optimization; Process control; Program processors; Map-reduce; joins; parallel computing; query optimization.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2011.47
Filename :
5710932
Link To Document :
بازگشت